Teneo Developers

Chinese Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization for example. Each language supported by the Teneo Platform has a chain of Input Processors that know how to process that particular language.

Input Processors Chain setup

The following graph displays the Input Processors chain for Chinese:

graph TD subgraph ips [ ] tokenizer[Chinese Tokenizer] --> annotator annotator[Chinese Annotator] --> number number[Chinese Numbers] --> annotation annotation[System Annotation] --> languagedetect languagedetect[Language Detector] --> predict end input([User Input]) --User Gives Input--> tokenizer predict[Predict] --Parsed Input--> parsed([To Dialog Processing]) classDef ip_optional stroke-dasharray:5,5; classDef external fill:#00000000,stroke-dasharray:5,5; class solution,settings external;

The Input Processors are listed below with a short description of the Input Processor's functionality, the follow sections will go into further details.

  • The Chinese Tokenizer IP first converts the user input to Simplified characters and then splits it into words and sentence.
  • The Chinese Annotator IP performs a morphological analysis on the user input sentences and words and annotates them to provide morphological information in addition to what the Tokenizer provides as words and their Part-of-Speech (POS) tags.
  • The Chinese Numbers IP identifies and annotates the numbers present in the user input to make it easier for the final user to write conditions that depend on numbers.
  • The System Annotation IP sets a number of annotations, based on properties of the user input text.
  • The Language Detector IP identifies the language of the input sentence provided and annotates it with the predicted language and associates a confidence score of the prediction.
  • The Predict IP classifies user input based on a machine learning model trained in Teneo Learn and annotates the user input with the predicted top intent classes and a confidence score.

Chinese Simplifier

The Chinese Simplifier is a special kind of processor that is used to normalize the user input by:

  • converting full width Latin letters and Arabic digits into their half width version, and
  • lowercasing the uppercased Latin letters.

This Simplifier is special because it is not run as part of the Input Processor chain, but rather by the Tokenizer when it puts the tokens into a Teneo data structure.

Additionally, the Simplifier is also run by the condition parser inside Teneo Engine, which normalizes the Language Object condition words before adding them to the internal Engine dictionary.

Chinese Tokenizer IP

The Chinese Tokenizer Input Processor is the first of the input processors to be run on Chinese user inputs; it essentially does two things: first, it converts traditional Mandarin Chinese characters into simplified, and secondly, it tokenizes the converted user input and generates sentences based on the tokens.

Traditional-to-simplified conversion

The conversion of traditional characters into simplified characters is done via a one-to-one characters mapping. This mapping is configured via two properties of the Chinese Tokenizer IP:

  • A list of characters: traditionalCharacters.file.name
  • The mappings of traditional characters to simplified characters: traditionalSimplifiedMappings.file.name.

After the conversion to simplified Mandarin Chinese, the user input is segmented into words and sentences.

Chinese tokenization

The Chinese Tokenizer splits the user input words via a statistical model and a user dictionary. The words specified in the user dictionary are guaranteed to be segmented as such by the Tokenizer.

The user dictionary has a static component, which is specified as a configuration file via the property dictionary, and a dynamic component, which is collected from the language objects defined in a user solution that have conditions of type DICTEXT_word_POStag.

The Chinese Tokenizer passes the Part-of-Speech (POS) tags generated by the Chinese Tokenizer to the user as annotations by mapping them according to a configuration file that maps Panda generated POS tags to annotations, e.g. NN=NN.POS.

The Tokenizer also uses a configuration property called nonWordTokens to specify which characters should not be output as tokens, e.g. punctuation, brackets, etc.

NameTypeRequiredDefault
nonWordTokensstringno""“” 『』'「」()[]{}()〔〕[]{}〈〉《》!!??…,,、。.;;::.

The last step in the tokenization process is the splitting of the user input tokens into sentences. For this, the Tokenizer uses another configuration property called sentenceDelimiters to know which characters mark sentence boundaries.

NameTypeRequiredDefault
sentenceDelimitersstringno。!?…!?..

The Chinese Tokenizer do not split decimal numbers around the decimal markers but rather concatenates the split tokens into one; this makes it easier to identify and annotate decimal numbers later in the processing chain.

Note that Numbers with a factor other than 万 or 亿 after the decimal point are not numbers and therefore are being split instead of being concatenated together. This is a change from the behavior in versions pre Teneo 6.

Chinese Annotator IP

The Chinese Annotator Input Processor includes a range of analyzers which treat specific morphological phenomena of Chinese. In general, three operations can be performed by the morphological analyzers:

  • Annotation (addition of one or more morphological annotations)
  • Change of the base form property
  • Concatenation of multiple tokens.

The morphological analyzers are applied in a fixed order; the table below shows the current sequence of analyzers, along with the operations that are performed by them. In the following sections more details are provided for each individual analyzer.

AnalyzerExampleAnnotationBase form changeConcatenation
1.VNotV Analyzer是-不-是YesYesYes
2.Verb Analyzer吃-完,跑-上YesYesYes
3.Reduplication Analyzer红-红YesYesYes
4.Loc Analyzer桌子-上YesYesYes
5.Aspect Analyzer吃 了, 坐 着YesNoNo
6.Negation Analyzer不 吃YesNoNo
7.SC Analyzer洗了一个澡YesYesNo
8.Affix Analyzer我们, 标准化YesYesNo

VNotV Analyzer

The VNotV Analyzer concatenate and analyses V-Not-V sequences.

In V-Not-V structures, the same verb occurs twice with a negation word (不, 没, 否) between the two occurrences:

  1. 你 去-不-去 买 东西?
    Nǐ qù-bù-qù mǎi dōngxī?
    You go-VNOTV.BU-go buy things
    ‘Do you go shopping?’

The V-Not-V structure has two uses:

  1. 我 不 知道 他 去-不-去 买 东西。
    Wǒ bù zhīhào tā qù-bù-qù mǎi dōngxī.
    I NEG know he go-VNOTV.BU-go buy things
    ‘I don’t know whether he goes shopping.’

In the case of bi-syllabic words, the second syllable of the first verb might be deleted:

  1. 你 喜(欢)-不-喜欢 买 东西 ?
    Nǐ xǐ(huān)-bù-xǐhuān mǎi dōngxī?
    You like-VNOTV.BU-like buy things
    ‘Do you like shopping?’

The VNotV Analyzer concatenates the three tokens. it assigns one structural annotation (prefixed with VNOTV) signaling the negation form. The base form of the resulting token is set to the full form of the verb. Thus, the example 3 above WITH second syllable deletion is analyzed as follows:

  • The three tokens are concatenated into one word 喜不喜欢
  • This word gets the base form 喜欢
  • Additionally, it gets the annotation VNOTV.BU.

VerbAnalyzer

The Verb Analyzer performs analysis of resultative and directional compounds. Resultative compounds consist of one main verb and one resultative suffix:

  1. 小王 吃-完 了。
    Xiǎowáng chī-wán le.
    Xiaowang eat-RESULT ASPECT
    ‘Xiaowang finished eating.’

Directional compounds consist of one main verb and one or two directional suffixes:

  1. 阿明 跑-上 楼机 了。
    Āmíng pǎo-shàng lóujī le.
    Aming run-DIR.NONDEICTIC.SHANG stairs ASPECT
    ‘Aming ran up the stairs.’

  2. 阿明 跑-上-去 了。
    Āmíng pǎo-shàng-qù le.
    Aming run-DIR.NONDEICTIC.SHANG-DIR.DEICTIC.QU ASPECT
    ‘Aming ran up.’

The combination of the main verb with the resultative/directional complements is concatenated. The base form of the resulting token is changed to the base form of the main verb. The token is assigned the annotations associated with the resultative/directional suffixes.

Annotations for resultative suffixes carry the prefix RESULT. Annotations for directional suffixes carry the prefix DIR. Additionally, we distinguish between deictic (DIR_DEICTIC…) and non-deictic (DIR_NONDEICTIC…) directional complements. Cases with two directional suffixes are limited to a non-deictic complement followed by deictic complement.

ReduplicationAnalyzer

The Reduplication Analyzer analyzes reduplications of verbs, adjectives and adverbs:

  1. a. 红-红
    hóng-hóng
    red-red
    ‘very red’

    b. 讨论-讨论
    tǎolùn-tǎolùn
    discuss-discuss
    ‘to discuss a little’

Reduplication of adjectives manifests some variability in the distribution of the syllables. Specifically, some adjectives expose the following asymmetric reduplication patterns:

  1. a. AABB:
    干净 干-干-净-净
    gānjìng gān-gān-jìng-jìng
    clean clean-clean
    ‘clean very clean’

    b. ABB:
    雪白 雪白白
    xuébǎi xué-bǎi-bǎi
    white white-white
    ‘white very white’

    c. AAB:
    逛街 逛逛街
    guàngjiē guàng-guàng-jiē
    walk street walk-walk-street
    ‘walk street go window shopping’

In verbal reduplication, the particles 一 and 了 can occur between the two copies:

  1. 看-一-看
    kàn-yī-kàn
    look-one-look
    ‘to take a look’

If the two words are segmented in the original tokenization, they are concatenated by the Reduplication Analyzer. The reduplicated word gets the annotation REDUP.

Loc Analyzer

The Loc Analyzer performs concatenation and analysis of noun + localizer combinations; localizers follow nouns and ‘transform’ them into locative nouns:

  1. 桌子-上
    table-LOC.ON.SHANG
    ‘on the table’

Localizers form a closed set; the table below shows the mapping from localizers to their annotations.

Form of localizerAnnotation
LOC_ON_SHANG
LOC_UNDER_XIA
LOC_INSIDE_LI
LOC_INSIDE_NEI
LOC_OUTSIDE_WAI
LOC_BEFORE_QIAN
LOC_BEHIND_HOU
LOC_NEXTTO_PANG
LOC_IN_ZHONG

The Loc Analyzer concatenates the noun + localizer combination into one word and assigns it an annotation with the label of the localizer. The base form of the resulting token is set to the base form of the noun. Thus, example 10 is analyzed as follows:

  • The two tokens are concatenated into one word 桌子上.
  • This word gets the base form 桌子.

Additionally, it is assigned the annotation LOC_ON_SHANG.

Aspect Analyzer

The Aspect Analyzer analyzes aspect markers. Chinese has both pre-verbal and post-verbal aspect markers:

  1. a. 她 正在 吃。
    Tā zhèngzài chī.
    she ASPECT eat.
    ‘She is eating.’

    b. 她 吃 了。
    Tā chī le.
    she eat ASPECT
    ‘She ate.’

MarkerAspectAnnotationPosition
PerfectiveASPECT_PERFECTIVE_LEPostverbal
ProgressiveASPECT_PROGRESSIVE_ZHEPostverbal
ExperientialASPECT_EXPERIENTIAL_GUOPostverbal
ProgressiveASPECT_PREVERBAL_PROGRESSIVE_ZAIPreverbal
正在ProgressiveASPECT_PREVERBAL_PROGRESSIVE_ZHENGZAIPreverbal

The set of aspect markers that are analyzed by the Aspect Analyzer are displayed in the above table.

The Aspect Analyzer attaches the respective annotation of the aspect marker to the main verb.

Negation Analyzer

The Negation Analyzer analyses negations of adverbs, verbs and adjectives. In these cases, the negation particle can immediately precede the negated word:

  1. a. 我 没 去。
    I NEG.MEI go
    ‘I didn’t go’

    b. 不 容易
    NEG.BU easy
    ‘not easy’

    c. 不 太
    NEG.BU too
    ‘not too’

The negation particle can also be separated from the verb by additional material:

  1. 别 这么 做。
    Bié zhème zuò.
    NEG.BIE this do
    ‘Don’t do this.’

The set of currently analyzed negation words is shown in the below table.

Form of negatorAnnotation
NEG_BU
NEG_FOU
NEG_MEI, ASPECT_PERFECTIVE
没有NEG_MEIYOU, ASPECT_PERFECTIVE
NEG_BIE, MODE_IMPERATIVE
不太NEG_BUTAI
并不NEG_BINGBU
不怎么NEG_BUZENME

Three of the negation particles (没, 没有, 别) have two annotations. Their second annotation contains aspectual or mode information that is implied by the particle. The NegationAnalyzer attaches an annotation to the negated word. It contains the corresponding annotation of the negation particle as well as its index in the sentence. An additional annotation is attached to the negated word if the negation particle carries aspect or mode information.

For example, in example 12. a (further above), the verb 去 is annotated with two annotations, {‘NEG_MEI’, 1} and ASPECT_PERFECTIVE.

SC Analyzer

The SC Analyzer analyzes splitable compounds; the splitable verb-object compounds (SCs) are verb-object combinations with an idiomatic meaning, e.g. 担-心 (worry+heart = ‘to worry’), 生-气 (create+air = ‘to get angry’), 见-面 (see+face = ‘to meet so’). They allow for various kinds of syntactic activity between verb and object, e.g. insertion of aspect markers, additional objects, demonstratives, etc.:

  1. a. Aspect marker:
    我们 见- 了 -面
    we see- ASPECT -face
    ‘We met.’

    b. Additional object:
    帮- 她 一个 -忙
    help- she one -affair
    ‘to help her’

    c. Nominal modifier:
    见 他 的 面
    see- he DEG -face

The set of SCs is large and diverse. Although it is difficult to exhaustively enumerate all SCs, the most common instances are captured in a list with currently 163 compounds. Once the SC Analyzer identifies a verb in a splitable compound, it goes forward in the sentence and looks for a valid CS object for this verb. While looking, it checks with each subsequent word whether the sequence following the verb is still a valid splitting sequence. If it arrives at a suitable object before the sequence becomes invalid, it attaches an annotation to the verb. This annotation carries two pieces of information: the tag of the splitable compound (SPLIT_pinyin of compound) as well as the index of the dependent object. Further, the base form of the verb is set to the base form of the splitable compound.

Thus, in the example 14. a above, the verb 见 is annotated with the annotation {SPLIT_JIANMIAN, 3}. Its base form is set to 见面.

Affix Analyzer

The Affix Analyzer analyzes inflectional and derivational suffixes. Chinese only has one inflectional suffix, that is the plural suffix -们, which can be attached to human nouns/pronouns:

  1. a. 老师-们
    teacher-PLURAL
    ‘the teachers’

    b. 我-们
    me-PLURAL
    ‘we’

Additionally, Chinese has a set of derivational suffixes which change the part of speech of the word to which they are attached. For example, the suffix -者 is attached to verbs, and the resulting combination is a noun and denotes the actor of the base form verb:

  1. 使用-者
    shǐ-yòng(-)zhě
    use-ACTOR.ZHE
    ‘the user’

A suffixed word gets the corresponding annotation of its suffix, and the base form of the word is changed to the base form without the suffix. Thus, 使用者 in example 16 is analyzed as follows:

  • 使用者 gets the annotation ACTOR_ZHE
  • 使用者gets the base form 使用.

The below table displays the set of tags used by the Affix Analyzer.

Form of affixAnnotationExample
-于COMPARATIVE_YU高于 (两米)
-度PROPERTY_DU精确度
-性PROPERTY_XING流线性
-化TRANSFORM_HUA现代化
-者ACTOR_ZHE使用者
-师ACTOR_SHI设计师
-员ACTOR_YUAN操作员
可-ABILITY_KE可上升
-们PLURAL_MEN老师们
-城CITY_CHENG北京城
-市CITY_SHI上海市
-省PROVINCE_SHENG河北省
-儿RCOLORING_ERHUA好玩儿
-于 (word contains the suffix and has a base form of at least 2 characters)PREP_YU致力于

Chinese Numbers IP

The Chinese Numbers Recognizer Input Processor simplifies conditioning against numbers and numeric expressions in Teneo Studio solutions and provides the following functionalities:

  • Normalization of tokens containing numeric values into Hindu-Arabic numerals
  • Creation of a NUMBER annotation with a numericValue variable which as type BigDecimal and contains a representation of normalized numbers
  • Creation of an annotation with the name of the normalized number value
  • Annotate inexact numbers with annotation INEXACT (i.e. numbers containing characters 几 or 数 or 余 or 多).

The Chinese Numbers Recognizer Input Processor leaves the tokenization unmodified and does not try to concatenate neighboring numeric expressions, nor does it split numeric parts of a token from its non-numeric parts. It will however identify and annotate tokens which contain numeric subparts, e.g. having the token “三点”, the normalized numeric value would be 3. Furthermore, it works with decimal factored numbers like 5.5万or 1.2亿 and supports fractions and formal Kanji numbers.

Numeric normalization

Numeric string normalization is done to substrings in the input string. The normalized values are used in creation of annotations, the input string itself remains unmodified. The following normalization steps are applied by the Chinese Numbers IP:

  • Hindu-Arabic numerals remain unchanged
  • Hanzi numerals are normalized to their Hindu-Arabic numeric value
  • Mixed Hanzi/Hindu-Arabic numerals are normalized to Hindu-Arabic numerals.
Input tokenNormalized Numeric Value
1010
3.143.14
1
一点1
两百200
三百万五千3005000
3百万5千3005000
三百五350
一万零一10001
一万〇一10001

The above table shows examples of normalization; in the last three examples it is possible to see that even more colloquial numeric expressions such as “三百五” are handled correctly.

The NUMBER annotation

The NUMBER annotation allows for conditioning on existences of numbers in user inputs, without the need to specify any number explicitly. The only thing the Teneo Studio user should do is use the NUMBER annotation in the condition. For example:

tlml

1%I_WANT.PHR + %$NUMBER + %PRODUCT.LIST
2

The numeric value can also be retrieved using a listener and used later in the flow. The listings below show how numeric value retrieval is done.

tlml

1%$NUMBER + PRODUCT.LIST
2

properties

1int numberAnnotIndex = (_.usedWordIndices as List)[0] 
2
3def numberAnnot = _.inputAnnotations.getByName('NUMBER').find { 
4    
5	// be sure that the annotation points to the correct word 
6    numberAnnotIndex in it.getWordIndices() 
7} 
8
9	// stores value in flow variable numProducts 
10	numProducts = annot.getVariables()['numericValue'] as int
11

The numeric value can also be retrieved using an NLU variable:

tlml

1%I_WANT.PHR + %$NUMBER^{someVariable=lob.numericValue} + %PRODUCTS.LIST
2

The normalized number annotation

The normalized number annotation is just the numeric value of the NUMBER annotation as an annotation itself. This allows the Teneo Studio user to condition against specific numbers, without the need to specify all the different surface variants. Thanks to the traditional-to-simplified Chinese character conversion done in the Chinese Tokenizer IP, even traditional numeric Hanzi characters match.

In the below table please find examples of normalized number annotations.

ConditionMatching inputs
%$2'2', '两', '二', '2', ...
%$10000'10000', '万', '萬', '一万', '一〇〇〇〇', '10000', ...
%$3.14'3.14', '三.一四', ...
%$350'350', '三百五', '三五〇', ...
%$1234'1234', '1234', '一二三四', '一千两百三十四', ...

Date and Time annotations

The TIME.DATETIME and DATE.DATETIME annotations are created in the Teneo Platform for numbers which could be either time or date expressions, for example 五点零零 creates annotationTIME.DATETIME with values hour: 5 and minute: 0, or 1/2 creating the DATE.DATETIME annotation with values month: 1, day: 2.

To read more about how to use the natively understanding and interpretation of date and time expressions in the Teneo Platform, please see here.

System Annotation IP

The System Annotation Input Processor, shared among the different languages of the Teneo Platform, performs simple analysis of the sentence text to set some annotations. The decision algorithms are configurable by various properties. Further customization is possible by sub-classing this Input Processor and overriding one or more of the methods decideBinary, decideBrackets, decideEmpty, decideExclamation, decideNonsense, decideQuestion, decideQuote.

This IP works on the sentences passed in, but does not modify them.

Other considerations

Extra request parameters read by this input processor: (none) Processing options read by this input processor: (none) Annotations this input processor may generate:

  • _EMPTY: the sentence text is empty
  • _EXCLAMATION: the sentence text contains at least one of the characters specified with property exclamationMarkCharacters
  • _EM3: the sentence text contains three or more characters in a row of the characters specified with property exclamationMarkCharacters
  • _QUESTION: the sentence text contains at least one of the characters specified with property questionMarkCharacters
  • _QT3: the sentence text contains three or more characters in a row of the characters specified with questionMarkCharacters
  • _QUOTE: the sentence text contains at least one of the characters specified with property quoteCharacters
  • _DBLQUOTE: the sentence text contains at least one of the characters specified with property doubleQuoteCharacters
  • _BRACKETPAIR: the sentence text contains at least one matching pair of the bracket characters specified with property bracketPairCharacters
  • _NONSENSE: the sentence probably contains nonsense text as configured with properties consonants, nonsenseThreshold.absolute and nonsenseThreshold.relative
  • _BINARY: the sentence text only contains characters specified by properties binaryCharacters (at least one of them) and binaryIgnoredCharacters (zero or more of them).

Special System annotations

Two special annotations related not to individual inputs, but to whole dialogues, are added by the Teneo Engine itself:

  • _INIT: indicates session start, i.e. the first input in a dialogue
  • _TIMEOUT: indicates the continuation of a previously timed-out session/dialogue.

Several configuration properties are available for the System Annotation Input Processor; please see the details here.

Language Detector IP

The Language Detector Input Processor uses a machine learning model that predicts the language of a given input and adds an annotation of the format %${language label}.LANG to the input as well as a confidence score of the prediction.

Language Detector annotation

The Language Detector IP can predict the following 45 languages (language label in brackets):

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language, under the label SR_HR and Indonesian and Malay are treated as one language, under the label ID_MS.

A number of regexes are also in use by the Input Processor, helping the model to not predict language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR) language annotations will always be created, even for predictions below 2.0, since the Language Detector is mostly accurate when predicting them.

Predict IP

The Predict Input Processor makes use of a machine learning model generated in the Teneo Learn component when machine learning classes are available in a Teneo Studio solution. The Predict IP uses the model to annotate each user input with the machine learning classed defined.

Whenever the Predict IP receives a user input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:

  • the confidence is above the minimum confidence (defaults to 0.01)
  • the confidence is higher than 0.5 times the confidence value of the top class.

The Predict Input Processor will create a maximum of 5 annotations, regardless of how many classes match the criteria. The numerical thresholds can be configured in the properties file of the Input Processor.

Predict annotations

For each selected class, an annotation with the name <CLASS_NAME>.INTENT will be created, with the value of the model confidence in the class. A special annotation <CLASS_NAME>.TOP_INTENT is also created for the class with the highest score.

Configuration properties

NameTypeRequiredDefault
minConfidenceSimilarityDistancefloatno0.5

Confidence percentage of the top score confidence a class must have in order to be considered, e.g. if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.

NameTypeRequiredDefault
maxNumberOfAnnotationsintno5

Maximum number of class annotations to create for each user input.

NameTypeRequiredDefault
minConfidenceThresholdfloatno0.01

Minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.

NameTypeRequiredDefault
intent.model.file.namestring (filename)noinexistent

Name of the file containing the machine learning model. It is usually set automatically by Teneo Studio, so no configuration is required.

Custom Input Processor configuration