Korean Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs or spelling correction. Each language supported by the Teneo Platform has a chain of Input Processors which know how to process that particular language. This pages details the Input Processors chain for the Korean language.

IP Chain Setup

The following graph displays the setup of the Korean Input Processors chain; each Input Processor is described further in the following sections.

graph TD subgraph ips [ ] splitting[Standard Splitting] --> morphological[Korean Morphological Analyzer] morphological[Korean Morphological Analyzer] --> annotation[System Annotation] annotation[System Annotation] --> number[Basic Number Recognizer] number[Basic Number Recognizer] --> languagedetect[Language Detector] languagedetect[Language Detector] --> predict[Predict] end input([User Input]) --User gives input--> splitting predict[Predict] --Parsed input--> parsed([To Dialog Processing])

Korean Simplifier

The Korean Simplifier is a special kind of processor that is used to normalize the user input by:

converting full width Latin letters and Arabic digits into their half width version, and
lowercasing the uppercase Latin letters.

This Simplifier is special because it is not run as part of the Input Processors chain, but rather by the Tokenizer when it puts the tokens generated by Kuromoji into a Teneo data structure. Additionally, the Simplifier is also run by the condition parser inside the Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary.

Input Processors

Standard Splitting

The Standard Splitting Input Processor splits the user input text into sentences and words; this Input Processor generates one or more sentences with zero or more words. The generated WordData objects contain the original and the simplified form of the word; the final word form is initialized with the simplified word form.

Korean Morphological Analyzer

The Korean Morphological Analyzer runs Komoran on every sentence from the user input as provided by the Standard Splitting Input Processor. Komoran returns the root of every word in the sentence as well as a tag that contains both Part-of-Speech (POS) and morphological information. The Korean Morphological Analyzer then converts into Teneo annotations the root, the Part-of-Speech and morphological information for every word.

The table below lists how the tags from Komoran are mapped to annotations in Teneo.

Komoran tag	Description	Map to Teneo annotation(s)
VA	Adjective	ADJ.POS
JKG	Adnominal case marker	JKG.MST
MAG	Adverb	ADV.POS
JKB	Adverbial case marker	JKB.MST
VCP	Affirmation/positive	VCP.MST
NA	Analytical Category	NA.MST
JX	Auxiliary postpositional particle	JX.MST
VX	Auxiliary predicate element	VX.MST
NNB	Bound noun	NN.POS , NNB.MST
SN	Cardinal number	CARDINAL.POS
JKC	Complement case marker	JKC.MST
JC	Conjunctive postpositional particle	JC.MST
MAJ	Connective adverb	ADV.POS, CONNECTIVE.MST
EC	Connective ending	EC.MST
VCN	Denial/negative	VCN.MST
MM	Determiner	DET.POS
ET	Ending of a word	ET.MST
ETM	Ending of a word	ETM.MST
SL	Foreign word	FOREIGN.POS
SH	Foreign word	FOREIGN.POS
IC	Interjection	INTERJ.POS
NNG	Noun	NN.POS
NF	Noun estimation category	NF.MST
NR	Numeral	NR.MST
JKO	Object case marker	JKO.MST
JKQ	Postposition, postpositional particle	JKQ.MST
EP	Pre-final ending	EP.MST
XPN	Prefix	XPN.MST
NP	Pronoun	PRON.POS
NNP	Proper Noun	NN.POS , PROPER.POS
SF	Punctuation	PUNCT.POS
SP	Punctuation	PUNCT.POS
SS	Punctuation	PUNCT.POS
SE	Punctuation	PUNCT.POS
SO	Punctuation	PUNCT.POS
XR	Root	XR.MST
EF	Sentence-closing ending	EF.MST
JKS	Subject case marker	JKS.MST
XSN	Suffix	XSN.MST
XSV	Suffix	XSV.MST
XSA	Suffix	XSA.MST
SW	Symbol	SYM.POS
VV	Verb	VB.POS
NV	Verb estimation category	NV.MST
JKV	Vocative case marker	JKV.MST

System Annotation

Teneo bundles two default collections of annotations in all language configurations: standard annotations added by the System Annotation Input Processor and special system annotations added by the Engine; the System Annotation Input Processor performs simple analysis of the sentence texts and may generate the standard annotations listed below.

Annotation	Description
_BINARY	The input consists of only 0s and 1s
_BRACKETPAIR	At least one matching pair of brackets appears in the input; possible bracket types: ( ), [ ], { }
_EXCLAMATION	At least one exclamation mark (!) appears in the input
_EM3	Three (or more) exclamation marks (!!!) appear in a row in the input
_EMPTY	The input contains no text / the sentence text is empty
_NONSENSE	The input contains nonsense text, e.g., 'asdf', 'wgwwgwg', 'xxxxxx'
_QUESTION	At least one question mark (?) appears in the input
_QT3	Three (or more) question marks (???) appear in a row in the input
_QUOTE	At least one single quotation mark (') appears in the input
_DBLQUOTE	At least one quotation mark (") appears in the input

Special System Annotations

The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.

Annotation	Description
_INIT	Indicates session start, i.e., the first input in a dialogue
_TIMEOUT	Indicates the continuation of a previously timed-out session/dialogue

Basic Number Recognizer

The Basic Number Recognizer identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with an annotation associated with a variable which holds the actual numeric value of the number found.
The Basic Number Recognizer is language dependent and each language has its own configuration defining the decimal point characters and the thousands separator character to be ignored.

Annotation	Variable	Description
NUMBER	numericValue	Annotation created for identified Arabic numbers in user inputs

For the annotation and its numeric value variable to be added, a number in the user input must meet the following syntax:

It must match the regular expression:

[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+

It must be parseable by Java's BigDecimal to ensure it is a number

The above syntax provides the following guarantees:

The sign is not included in the annotated token
The numericValue variable contains a BigDecimal representation of the number.

In the above example regex, the dot is used as a decimal marker and the comma as a regular expression; as described earlier this configuration is language dependent and therefore varies depending on the selected solution language.

Language Detector

The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.

Annotation	Variable	Description
<language label>.LANG, e.g., %$DA.LANG	Confidence	Annotation created for the predicted language

The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS

A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.

Predict

The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.

When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:

the confidence is above the minimum confidence (defaults to 0.01)
the confidence is higher than 0.5 times the confidence value of the top class.

For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

Annotation	Variable	Variable	Variable	Description
<CLASS_NAME>.TOP_INTENT	classifier	confidence		Annotation created for the class with the highest confidence score
<CLASS_NAME>.INTENT	classifier	confidence	Order	Annotation given to each selected class with a maximum of five top classes

The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.