The Analyses view contains a list of the existing analyses available in the server, which are each ordered under a Project.
This area gives different information to the user related to the analyses, such as Creation time, Status, number of Inputs, number of Concepts / Associations, etc.
In the top of the Analyses view, the user finds the Analysis toolbar which allow the user run or re-run analyses or to take different actions on a selected analysis.
The available options are (left-to-right):
- Start a new analysis
- Re-run the selected analysis
- Regenerate the associations of the selected analysis from the concepts
- Generate categories automatically
- View the source used
- View the settings used
- Click to see the analysis statistics
- Rename this analysis
- Delete this analysis
- Rename this project
- View this analysis as a mosaic
- View this analysis as a pie chart
- View this analysis as a table
- View this analysis as a table with it's out of vocabulary words
Create an analysis
To create an analysis in Teneo Discovery, the Analyses view needs to be opened. To open it, simply click the Analyses button in the top left corner of the window.
Now, in the Analyses toolbar, click the Start a new analysis button, as illustrated in the below image.
The Choose data source view opens; select which data source to use:
- Local files: browse to a file stored locally (supported file formats are .txt (encoded with UTF-8), .csv, .tsv, .lwl and .lwl.gz
- Remote raw data sets: select an already uploaded unprocessed data set
- Remote processed data sets: select an already uploaded and processed data set.
It is possible to multi-select several files or data sets by using the Control key.
When the data set is selected, the New Analysis wizard opens.
In the New Analysis Wizard, the following options are available:
- Analysis name: write the name of the analysis
- Project name: select one of the existing projects on the server or write the name of a new project-folder
- Analysis language: by default, the language used is the language selected under global settings. To change the language simply click the button and select the correct language.
- Create associations
- Named Entity Recognition: per default, Teneo Discovery runs Named Entity Recognition (NER) and annotate concepts with named entity types, when this is applicable for the analysis language. Click the Entities to merge option to alter the list of entities to merge automatically.
- Create categories automatically: to help with the organization, the option Topic Modelling provides a machine learning algorithm to automatically propose categorization of associations. Simply select this option by clicking the dropdown menu. For more information please see this topic.
When clicking Start analysis, the file will be uploaded to the Discovery server and Teneo Discovery will process and mine the data to create the analysis. Learn more about the Analysis types.
Create categories automatically
To help with the organization, navigation, and overview in analyses, categories can be created automatically during the analysis of the data.
This can be done manually or Teneo Discovery can perform this automatically, either natively, using unsupervised machine learning in form of topic modelling or, if the user uploads a customized classification model or Teneo solution with Classes, using the uploaded model to categorize the discoveries in an supervised way.
The categories are represented visually as a folder structure and each category folder contains the associations and concepts related to it.
In the New Analysis Wizard, under the option Create categories automatically, the user can choose between Topic Modelling, Custom ML Model, Template or None.
When Topic Modelling is selected, the user can specify the Number of categories and the Minimum Topic confidence score to use for a category to be created. (note that number of categories also depends on the size of the data set: if the data set is small, fewer than specified might be created).
The more topics/categories, the more fine-grained categories will be found. The topic confidence threshold is used to discard low confident topics. High confident topics are in theory better but depending on the size of the data source used, they might be quite small/low in content.
Custom ML Model
The user can also create categories automatically using the machine learning model of a Teneo solution (it needs to have Classes) or uploading a MLEAP model.
To run the ML model on the data, select this option under Create categories automatically. In order to run the model on the data, the user needs to select the model to use. The following two types of files can be uploaded:
- .solution files: a solution that contain Classes exported from Teneo Studio.
- .zip files: zip file containing a MLeap format model.
Once the file has been selected, the user needs to set the minimum confidence for a class to match an input and the minimum percentage of inputs that a particular class that a concept/association have to have in order to be assigned to a category.
Categories can also be created by using a template. With this option, the selected category structure will be applied to the analysis, but no concepts or associations will be added automatically to any of the categories.
When clicking the Show advanced options in the New analysis wizard, the user is able to set the following parameters:
- Spelling corrector (options: LUCENE, NONE, TENEO)
- Min. confidence
- Simplification (option: Engine, DISCOVERY)
- Split sentences
- Concepts - Minimum occurrence
- Associations - Minimum occurrence
- Use stop words (default: Standard Language)
- Use stemming
- Use anti-stem words
- Auto-merge Concepts (Default merging list (language code))
- Keep concepts to be merged
- Keep numbers
Note that the different options are explained in subsections below the image.
Teneo Discovery is able to find misspellings and map them to the correct concept; misspelled and corrected word forms are tagged as typo wherever displayed.
The default option is to use Lucene's misspelling detection with a confidence level of 80%. Users can alter the confidence threshold and also choose to rather use the Teneo misspelling detection.
For the Teneo option, Discovery makes use of the Teneo StandardAutoCorrection Input Processor and the Teneo StandardSimilarityMatchCorrection Input Processor.
The Lucene algorithm is much faster, but not quite as accurate as the Teneo option. Also see the Input Processors section.
By default, Teneo Discovery uses the language specific simplification rules from the input processing chains, although users can also select to use the Discovery simplification, which applies a bit lighter simplification rules. Also see the Input Processors section.
Teneo Discovery, by default, treats each sentence as an input. With this setting, it is possible to turn off sentence splitting. This can be useful for some data files with shorter inputs and allows Teneo Discovery's association miner algorithm to find associations over sentence boundaries.
Concepts/Associations Minimum occurrence
This setting regulates the minimum times a word or sequence of words must appear in the input file to be taken into account for the analysis.
The default number for concepts is 6, meaning, a word or sequence of words needs to appear at least 6 times in the input data file for a concept to be created based on it. Words occurring less times than this will be disregarded, and will end up in the Uncategorized user inputs concept. The default number for associations is 8.
The higher the minimum occurrence number, the less concepts will be created, but also the faster the processing time will be. Teneo Discovery creates a maximum of 150 000 concepts in an analysis, independent on how big the input file is.
A stop word list works as a filter on the data source and when a stop word list is applied for an analysis, any of the words in the list will be disregarded and filtered out from the analysis, meaning no concepts will be created based on those words.
Stop words are typically function words, filler words, profanity and other non-meaning bearing words. A language specific stop word list is applied as default for analyses, but users can also select their own stop word lists or run analyses without any stop words.
By default, Teneo Discovery's language specific stemming rules are applied and words that originates from the same stem are grouped together under one concept. With this setting, users can turn the stemming off if desired.
Under this setting users can load a text file with words to be discarded from the stemming rule. It is also possible to create an anti-stem word list direct in the interface.
Teneo Discovery allows for concept merging already during the analysis creation phase.
A list of concepts to merge is used per default, grouping countries to a countries.list concept, colors to a color.list concept, synonyms for answer/reply/etc. to an answer.syn concept, etc.
Users can add new items to a copy of the merging lists either directly while viewing an analysis, under the Global settings, or, under Advanced options when starting a new analysis, add a custom merging list.
To do so, in the New analysis wizard, simply click the Default merging list (language) link.
Keep concepts to be merged
As per default, any words part of a merged concept will not yield its own concept, but If enabling this setting, concepts that are part of merged concepts will be included in the analysis as individual concepts as well.
By default, all numerical values are grouped into the concept ANY_NUMBER. If checking this setting though, numerical values will appear in their original form in the analysis. To exclude all numerical values from an analysis altogether, users can add the string ANY_NUMBER to the stop word list used in the analysis.