Teneo Input Processors
Support for 48 additional languages
The Teneo 6.0 Platform introduces input processing for 48 additional languages, meaning that the Teneo Platform now supports a total of 86 languages.
All the languages, listed in the below table, are now supported using the Standard Input Processor chain but Korean which comes with a custom Input Processor chain incorporating a bespoken Morphological Analyzer along with standard splitting, annotation, number recognition, language detection and of course Predict.
|Ewe||Kirundi (Rundi)||Oromo||Swahili (Kiswahili)||Zulu (isiZulu)|
Language Detector Input Processor
Teneo6.0 comes with an updated and improved Machine Learning model for the Language Detector Input Processor. The new model is more accurate, but predicts a language at a lower confidence score compared to the previous one.
Previously, the Language Detector Input Processor only created annotations for predictions with a confidence score over 0.25. Since the new model is a bit less confident, but more accurate, this setting is now adjusted and the Language Detector Input Processor will provide annotations when the confidence prediction threshold is above 0.20 for all languages, with the exception of predictions for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR). For these languages, annotations will always be created, even for predictions below 0.20, since the new model mostly is accurate when predicting them, but with an even lower confidence score.
For users of existing version of the Teneo Dialogue Resource, it might be a good idea to remove the predicate script restriction to the condition of the Language Detector Flow (as well as any other syntax condition using the
LANGUAGE.ANNOT Language Object) to be able to recognize languages also at lower predictions.
In the new version of the Teneo Dialogue Resources released with Teneo 6.0, this is adjusted accordingly.
Standard Simplifier Spanish and Catalan
The letter ñ/Ñ (in both upper and lower case) has been added to the property
excludeFromCanonicalSimplify property of the Standard Simplifier for Spanish and Catalan languages to ensure correct distinction between the letters n/N and ñ/Ñ in solutions of these languages.
Please note that users might need to update conditions if these depend on the former behavior treating the letters as one.
Chinese and Japanese IP improvements
Chinese Tokenizer and Number Recognizer IPs
- Add support for fractions and formal Kanji numbers to the Chinese Number Recognizer.
- Fix tokenization of "impossible" numbers, i.e. those that have factors after the decimal point.
- Add TIME.DATETIME annotation for special numbers that could be time expressions too, e.g. 12点30.
- Add DATE.DATETIME annotation for fractions that could be date expressions too, e.g. 1/2.
- Replace a.m./p.m. with am/pm in input text and separate from numbers.
- Split 零 in the context 5点零5分 because they can never represent a number together. 零 means "zero" only together with other Kanji numbers; with Arabic numbers and in this context it means "and".
- Split expressions with numbers and more than one slash, e.g. 2/3/2020.
Backwards compatibility issues
- 兆 used to be a factor with a large value, i.e. 1 trillion, but has been changed to work as a counter instead of a factor, which can imply backwards compatibility issues.
- Numbers with a factor other than 万 or 亿 after the decimal point are not numbers and therefore are now being split instead of being concatenated together. This is a change from previous behavior but it's a fix of the previously erroneous number tokenization.
Japanese Number Recognizer and Tokenizer
The Japanese Input Processor chain has been modified to use a bespoken Number Recognizer, number representations recognized includes:
Arabic numbers, formal and colloquial Kanji numbers, Hiragana numbers, numbers with counters not split from the actual numeric expression, numbers with factors both larger and smaller than zero, decimal numbers, fractions.
Also the Japanese Tokenizer has been modified to split tokens that contain slashes, dashes, tildes, colons, commas, dots and interpuncts in certain contexts as detailed below:
- split numbers from slashes and keep the slash as a separate token if there are at least two slahes in-between numbers, e.g. 25/04/2020, but not when there is a single slash between numbers as that is a fraction that the Number Recognizer needs to recognize, e.g. 2/3.
- split numbers from dashes, tildes and interpuncts and keep them as separate tokens, e.g. 25 - 04 - 2020, 25 ~ 04 ~ 2020, 25 ・ 04 ・ 2020.
- split numbers from dots and keep the dots as separate tokens, e.g. 25 . 04 . 2020; but not when there is a single dots between numbers as in that case the dot could be a decimal marker for numbers, e.g. 1.5.
- split numbers from comma and keep the comma as a separate token, e.g. 2 , 3; but not when there are three digits after the comma as in that case the comma could be a thousands separator for numbers, e.g. 1,000.
- split numbers from colons and keep the colon as a separate token, e.g. 10 : 30.
- split special number tokens that Kuromoji doesn't split, e.g. ２人 or ２，３.
Backward compatibility issues
The changes made to the JapaneseTokenizer IP break backwards compatibility in several ways:
- some tokens that were not split before are now split, e.g. ２人 or ２，３.
- colons used to be removed and now are kept as tokens, e.g. 10 : 30.
- slashes, dashes, tildes, dots and interpuncts used to be kept in one token with the numbers are now split, e.g. 25 - 04 - 2020, 25 ~ 04 ~ 2020, 25 ・ 04 ・ 2020.
Adjustments have been performed in the Japanese input processor chain to ensure expected behavior when carriage return is found in the input next to an emoji. This implies that carriage return has been added to the list of non-word tokens in the properties file of the Japanese Tokenizer.
A minor adjustment to a number of an annotation property in the Japanese Annotator's properties file has been performed, setting the configuration for the annotation property to 95.
Input Processing for Czech
Per the development of the new Czech Lexical Resource described further below, several properties files of the input processing chain for Czech have been reviewed to ensure correct simplification.
Teneo Lexical Resources
Native Date and Time
For the 6.0 release, Teneo has productized, localized and made native the Date & Time functionality originally available for English solutions at Teneo.ai; this means that the Teneo Platform now will natively include understanding and interpretation of date and time expressions for the following languages: Chinese, Danish, Dutch, English, French, German, Italian, Japanese, Norwegian, Portuguese, Spanish, and Swedish.
This works entails taking the already functional Date & Time package available in Teneo.ai and making it available natively within the Platform. For users this then means there is no need to add any external resources to have Date & Time understanding within a solution.
Productizing the functionality also ensures that:
- Naming is consistent with the rest of the Platform
- Development and QA life-cycle is followed to the same extend as the rest of the Platform
- Maintenance and support can be provided for the functionality - including seamless upgrades and new languages.
The Date and Time handling in the Teneo Platform consists of three parts:
- Specific date and time Language Objects in the Teneo Lexical Resources (to understand date and time expressions)
- The new DateTime Recognizer Input Processor in the input processor chains of relevant languages (to understand and annotate date and time expressions that aren't easily detected by Language Objects)
- A post-processing handler for interpreting the results of the date time input processor within the solution.
SUPPORT Language Objects
A new Language Object suffix,
SUPPORT, is introduced in the Teneo Lexical Resources as of the 6.0 Teneo release. Language Objects of this type are only ever meant to be used internally by the system (to support conditions of other Language Objects). As these SUPPORT Language Objects are not intended for use within the solution they will not be included as suggestions when using the Auto-complete functionality or the Condition Building assistant in the condition editor. Users can still find them with the Search functionality.
The support Language Objects for the Sentiment, Abusive Language and Date and Time Language Objects have been renamed and are using this new suffix. For the Language Objects of Sentiment and Abusive Language, the name change follow the pattern:
For updated names of Date and Time Language Objects, please refer to the separate Appendix - DateTime Language Object name map.
Czech Lexical Resource
A Teneo Lexical Resource (TLR) has been built for Czech. The Teneo Lexical Resources are Artificial Solutions’ proprietary resources containing off-the-shelf building blocks (Language Objects and Entities) to be used for the modelling of NLI solutions in Teneo Studio. Lexical Resources are a simple tick away in Teneo Studio and they give access to thousands of Language Objects and Entities ready to use as condition building blocks.
The Teneo 6.0 Platform release adds sentiment detection for Chinese. This means, the Teneo Platform can now detect Sentiment and Intensity of user inputs for: Chinese, Dutch, English, German, Italian and Swedish.
Language Objects from the Sentiment Resources as well as Abusive Language Resources (available in English) are now moved into the main Lexical Resources.
This means, when upgrading to the Teneo 6.0 Lexical Resource, any separate Sentiment Lexical Resource and Abusive Language Lexical Resource must be unassigned from the solution. For new solutions, only the main Lexical Resource needs to be assigned.
Deprecated Language Objects
The Teneo 6.0 release of Lexical Resources includes deprecated Language Objects in French, Portuguese, and Spanish.
Teneo Dialogue Resources
For this release, the Teneo Dialogue Resources have been re-worked and adapted to seamlessly work in a Master-Local setup. This means that Global elements, such as Order groups, Contexts, Variables and Emotions have aligned document Ids in all Dialogue Resources. Thanks to this, a Teneo Dialogue Resource can safely be used in a Master solution, and a different Teneo Dialogue Resource can be used in a local solution without risking that the global elements are duplicated.
Two new Scripted Contexts have been added to all Teneo Dialogue Resources:
- Input Sentences: the Context is fulfilled if the number of sentences in the user input matches selected state(s)
- Input Words: the Context is fulfilled if the number of words in the user input matches selected state(s).
All Classes in the Teneo Dialogue Resources now have a "TDR_" prefix added to their name.
The Language Detector Flows have been adapted to work with the new Language Detector model, now also recognizing languages at lower confidence levels.
The Timeout Flow has been removed, and instead, a separate
%$_TIMEOUT Trigger is added to the Greeting message Flow for all Dialogue Resources.