The Class Performance view (Teneo backstage: Solution tab > Optimization > Class Performance) provides the user of Teneo Studio a way of checking the performance of the Machine Learning (ML) model used in a solution and analyzes which classes are conflicting with one another.
The evaluation of the model is performed using a Cross Validation (CV) process, where all the training data examples of the solution's classes are split into K folds and, for each fold, a Machine Learning model is trained with the rest of the K-1 folds. Once the training is completed, the performance of the ML model is evaluated against the retained fold; the results of all these evaluations are averaged to get an estimation of the performance of the model.
It is important to remember that Cross Validation is an estimation of the performance of the model which can only be directly measured with a test dataset, and that it is stochastic in nature. This means that different executions of the Cross Validation process may give different results because the training data is randomly split into folds. Results would probably be stable for homogeneous classes with a high number of training data examples and a high variance could be a symptom of excessive heterogenicity among the training data of some classes.
The following metrics are used to evaluate the performance; these are standard metrics for performance measurement of Machine Learning models; an in-depth description can be found on Wikipedia.
- Precision measures the percentage of the detected classes which were really positive matches; i.e. measuring to which degree one can rely on the classifier having marked as positive only training data that is positive.
- Recall measures the percentage of the detected classes that were really positive matches; i.e. measuring how sure the classifier is of having retrieved all the existent positives in the dataset.
- F1 is the harmonic mean of Precision and Recall. It is usually used as a general measure of classifier performance.
Old executions of Cross Validation are kept for comparison purposes, but historic data has some size and time limitations:
- Failed Cross Validation executions are kept for one week after which they are removed form the server. In case the Studio backend service gets restarted while a CV process is running, the process will be stopped and marked as failed. In this case, the user would have to start the process again.
- There is no time limitation for succeeded CV executions, but the number of executions stored in the server is limited; by default, only the last 20 are kept, but this configuration may be changed in the server.
Confidence Threshold Graph
Whenever a new input arrives to the conversational AI application, the Machine Learning model analyzes that input and generates a set of class predictions, i.e. for each class in the model, it assigns a probability for the input to belong to that class. If the probability value of the most probable class exceeds a solution-wise confidence threshold, an annotation is created for that class and the Intent Trigger depending on that class is triggered.
The solution threshold is set up under the assumption that predictions with a very low degree of confidence will most probably be wrong. So the thresholding process can be thought of as a binary classifier that determines whether the predictions of the Machine Learning model are reliable for a given input or not (based only on the prediction confidence). The purpose of this view is to provide a tool to analyze the estimated performance of the classes in the solution with regard to this threshold setting.
The view shows the values of the classification metrics for the thresholding process for each value of the confidence threshold, in the [0, 1] range; i.e. considering the threshold a binary classifier whose training data examples are accepted predictions of the model (predictions with confidence over the threshold) and whose negative training data examples are rejected predictions (predictions with confidence value below the threshold).
In this context, the performance metrics can be interpreted in the following way:
- Precision measures the percentage of the accepted inputs (classifications with a confidence over the threshold) that were rightfully accepted; i.e. they correspond to training data that was correctly classified by the model.
- Recall measures the percentage of the correct inputs (training data that was correctly classified by the model) that were accepted by the threshold.
- F1 has the usual meaning as the harmonic mean between the other two metrics.
The values on this graph can be used to decide where to set the solution confidence threshold. There is no golden rule to set this value, but when taking the decision, consider the following pieces of advice:
- A high threshold value will reject dubious predictions, so if the solution contains Intent Triggers which depend on the classes, it will be highly improbable to mistakenly trigger a flow based on Machine Learning predictions. On the other hand, this will cause many correct predictions to be discarded due to a low confidence. A high threshold is probably wanted when the consequence of marking a wrong input as positive is worse than those of marking them as a negative (a typical example is spam classifiers).
- A low threshold value will make the solution accept more predictions, so one can be confident that most of the times a flow should be triggered by an Machine Learning prediction it will. Conversely, this will increase the probability of triggering a flow with an incorrect prediction. One probably want a low threshold when the consequences of losing one message are worse than those of processing an incorrect one (emergency services would be a typical example).
Setting the threshold implies a trade-off between these two situations; the appropriate value will depend on the particular use case and project.
Class Performance Table
The Class Performance Table shows the performance metrics for each of the classes in the solution, including how many errors correspond to false positives (FP), which are those predictions where the classifier assigned that class when it should have assigned another, and to false negatives (FN), which are those predictions where the classifier assigned another class, when it should have assigned the analyzed one.
The table displays one row for each class and a single row for the average values for all classes. For each row, the following columns are displayed:
- Class name name of the class
- Precision, Recall, F1 these are the binary classification metrics for the row's class, i.e. for all the training data examples whose ground truth class is the row's class, training data predicted as belonging to that class are considered positive and any other predictions as negatives.
- Examples number of training data examples of that class at the moment of execution of the Cross Validation.
- Conflicting classes shows the number of mistaken predictions of the model. Those predictions can be either false positives (FP) or false negatives (FN); the arrow at the end of the column unfolds a list of rows inside the cell, each one specifying one of the classes that were confused with the class of the row, the kind of error, and the percentage of classified training data that suffered from that kind of error for that particular class.
All the numeric columns and the class name are sortable, and classes which appears in the current execution, but didn't exist in the historic, ones are marked with a star.
If the user has selected an old Cross Validation to compare with, differences from the current run will be displayed as deltas on all the numerical values, with a green background if the metric has improved from the older execution, and red otherwise.