Turkish Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs, for example. Each language supported by the Teneo Platform has a chain of input processors that know how to process that particular language.

IP Chain Setup

The following graph displays the setup of the Turkish Input Processors chain; each Input Processor is described further in the following sections.

graph TD subgraph ips [ ] subgraph turkishanalyzer [Turkish Analyzer] normalization[Normalization] splitting[Sentence splitting] tokenization[Tokenization] POS[Part-of-Speech and Morphological annotation] end annotation[SystemAnnotation] --> number[BasicNumberRecognizer] number[BasicNumberRecognizer] --> languagedetect[LanguageDetector] languagedetect[LanguageDetector] --> predict[Predict] end input([User Input]) --User Gives Input--> normalization --> splitting --> tokenization --> POS --> annotation predict[Predict] --Parsed Input--> parsed([To Dialog Processing]) classDef contained stroke-dasharray:5,2; class normalization,splitting,tokenization,POS contained; classDef analyzer stroke:#2f286e,fill:#ffffff; class turkishanalyzer analyzer;

Standard Simplifier

The Standard Simplifier is a separate processing unit which is not an input processor and which provides a method to normalize some text, usually - but not necessarily - a word. Here, "normalization" means removal of text properties that are semantically insignificant, like conversion to lower case (considering the configured language locale), removal of some accents and normalization of Unicode combining characters. By default, the Input Processors call the Simplifier when they generate a new word item. Furthermore, the Simplifier is called by the language condition parser of the Teneo Engine when it stores a language condition word (i.e., TLML syntax word) in the solution dictionary.

Simplification
The Simplifier decompose and normalize characters, for example lower casing characters and converting to Unicode.

Input Processors

Turkish Analyzer

The Turkish Analyzer is based on Zemberek and performs the following tasks:

User input normalization,
Sentence splitting,
Tokenization, and
Part-of-Speech (POS) and morphological annotations.

Normalization, Sentence Splitting and Tokenization

The Turkish Analyzer performs normalization on user inputs and, furthermore, will segment the input into sentences, tokenize and analyze the morphological structure of each token in the context of the sentence.

This means that each sentence will be normalized by the Turkish Analyzer, i.e., the sentence will be lowercased and, in some cases, typos will be fixed. Unlike other Teneo input processors, the API method getOriginal() on a word object will return the normalized form (which might be different from the simplified form) as the normalization happens before the tokenization.
This has direct implications on the exact option, which for other languages works on the ORIGINAL form, but for Turkish, users need to be aware that the exact option operates on the normalized strings.
The original user input is not modified and can be retrieved with getUserInputText().

A sentence in Turkish is an instance of the TurkishSentence class, which implements the SentenceI interface from the engine-input-processor-api. The method getText() of the class TurkishSentence returns the normalized sentence text. The original sentence text can be retrieved with the method getRawSentence() within for a direct caller of the input processor chain by casting a Sentence to a TurkishSentence. It cannot be accessed via the engine scripting API.

The sentence indices point to the characters in the original user input string. The word indices point to the characters in the sentence, i.e. the normalized sentence string.

POS and Morphological Annotations

The Turkish Analyzer also annotates user inputs with POS and morphological information. Each word will be annotated with its lemma, if available. A lemma annotation contains the POS tag as an annotation variable pos:<string>.

The morphological information will be returned as annotations for the three different types that Zemberek returns with the following suffixes:

.POS: primary part-of-speech tag of the entire token
.POS/.NER: secondary part-of-speech tag (mix of entities/POS tags) based on the stem of the token
.MST: morphosyntactic information based on the morphemes of the token

The MST annotations all have the annotation variable surface=<string> that contains the substring of the surface form of that morpheme in the word, if available.

The table below lists how the tags from Zemberek are mapped to annotations in Teneo; please see here for information related to available ANNOT Language Objects in the Turkish Lexical Resource.

Zemberek Type	Zemberek Tag	Map to annotations
POS	Noun	NN.POS
POS	Adj	ADJ.POS
POS	Adv	ADV.POS
POS	Conj	CC.POS
POS	Interj	INTERJ.POS
POS	Verb	VB.POS
POS	Pron	PRON.POS
POS	Num	NUMERAL.POS
POS	Det	DET.POS
POS	Postp	POST_POSITIVE.POS
POS	Ques	INTERROG.POS
POS	Dup	DUPLICATOR.POS
POS	Punc	PUNCT.POS
POS2	Demons	DEMOS.POS
POS3	Time	TIME.NER
POS4	Quant	QUANTITATIVE.POS
POS5	Ques	INTERROG.POS
POS6	Prop	PROPER.POS
POS7	Pers	PERS.POS
POS8	Reflex	REFLEXIVE.POS
POS9	Ord	ORDINAL.POS
POS10	Card	CARDINAL.POS
POS11	Percent	PERCENT.NER
POS12	Ratio	RATIO.NER
POS13	Range	RANGE.NER
POS14	Dist	DIST.NER
POS15	Clock	CLOCK.NER
POS16	Date	DATE.NER
POS17	Email	EMAIL.NER
POS18	Url	URL.NER
POS19	Mention	MENTION.NER
POS20	HashTag	HASHTAG.NER
POS21	Emoticon	EMOTICON.NER
POS22	RegAbbrv	ABBREVIATION.NER
POS23	Abbrv	ABBREVIATION.NER
MST	Noun	NN.MST
MST	Adj	ADJ.MST
MST	Adv	ADV.MST
MST	Conj	CC.MST
MST	Interj	INTERJ.MST
MST	Verb	VB.MST
MST	Pron	PRON.MST
MST	Num	NUMERAL.MST
MST	Det	DET.MST
MST	Postp	POST_POSITIVE.MST
MST	Ques	INTERROG.MST
MST	Dup	DUPLICATOR.MST
MST	Punc	PUNCT.MST
MST	A1sg	1STPERSON.MST, SG.MST
MST	A2sg	2NDPERSON.MST, SG.MST
MST	A3sg	3RDPERSON.MST, SG.MST
MST	A1pl	1STPERSON.MST, PL.MST
MST	A2pl	2NDPERSON.MST, PL.MST
MST	A3pl	3RDPERSON.MST, PL.MST
MST	Pnon	NO_POSESSION.MST
MST	P1sg	POSS_1STPERSON.MST, POSS_SG.MST
MST	P2sg	POSS_2NDPERSON.MST, POSS_SG.MST
MST	P3sg	POSS_3RDPERSON.MST, POSS_SG.MST
MST	P1pl	POSS_1STPERSON.MST, POSS_PL.MST
MST	P2pl	POSS_2NDPERSON.MST, POSS_PL.MST
MST	P3pl	POSS_3RDPERSON.MST, POSS_PL.MST
MST	Nom	NOMINATIVE.MST
MST	Dat	DATIVE.MST
MST	Acc	ACCUSATIVE.MST
MST	Abl	ABLATIVE.MST
MST	Loc	LOCATIVE.MST
MST	Ins	INSTRUMENTAL.MST
MST	Gen	GENITIVE.MST
MST	Equ	EQUATIVE.MST
MST	Dim	DIMINUTIVE.MST
MST	Ness	NESS.MST
MST	With	WITH.MST
MST	Without	WITHOUT.MST
MST	Related	RELATED.MST
MST	JustLike	JUST_LIKE.MST
MST	Rel	RELATION.MST
MST	Agt	AGENTIVE.MST
MST	Become	BECOME.MST
MST	Acquire	ACQUIRE.MST
MST	Ly	LY.MST
MST	Caus	CAUSATIVE.MST
MST	Recip	RECIPROCAL.MST
MST	Reflex	REFLEXIVE.MST
MST	Able	ABILITY.MST
MST	Pass	PASSIVE.MST
MST	Inf1	INFINITIVE1.MST
MST	Inf2	INFINITIVE2.MST
MST	Inf3	INFINITIVE3.MST
MST	ActOf	ACT_OF.MST
MST	PastPart	PART_PAST.MST
MST	NarrPart	PART_NARRATIVE.MST
MST	FutPart	PART_FUTURE.MST
MST	PresPart	PART_PRESENT.MST
MST	AorPart	PART_AORIST.MST
MST	NotState	NOT_STATE.MST
MST	FeelLike	FEEL_LIKE.MST
MST	EverSince	EVER_SINCE.MST
MST	Repeat	REPEAT.MST
MST	Almost	ALMOST.MST
MST	Hastily	HASTILY.MST
MST	Stay	STAY.MST
MST	Start	START.MST
MST	AsIf	AS_IF.MST
MST	While	WHILE.MST
MST	When	WHEN.MST
MST	SinceDoingSo	SINCE_DOING_SO.MST
MST	AsLongAs	AS_LONG_AS.MST
MST	ByDoingSo	BY_DOING_SO.MST
MST	Adamantly	ADAMANTLY.MST
MST	AfterDoingSo	AFTER_DOING_SO.MST
MST	WithoutHavingDoneSo	WITHOUT_HAVING_DONE_SO.MST
MST	WithoutBeingAbleToHaveDoneSo	WITHOUT_BEING_ABLE_TO_DO_SO.MST
MST	Zero	ZERO.MST
MST	Cop	COP.MST
MST	Neg	NEGATIVE.MST
MST	Unable	UNABLE.MST
MST	Pres	PRESENT.MST
MST	Past	PAST.MST
MST	Narr	NARRATIVE.MST
MST	Cond	CONDITION.MST
MST	Prog1	PROGRESSIVE1.MST
MST	Prog2	PROGRESSIVE2.MST
MST	Aor	AORIST.MST
MST	Fut	FUTURE.MST
MST	Imp	IMPERATIVE.MST
MST	Opt	OPTATIVE.MST
MST	Desr	DESIRE.MST
MST	Neces	NECESSITY.MST

System Annotation

Teneo bundles two default collections of annotations in all language configurations: standard annotations added by the System Annotation Input Processor and special system annotations added by the Engine; the System Annotation Input Processor performs simple analysis of the sentence texts and may generate the standard annotations listed below.

Annotation	Description
_BINARY	The input consists of only 0s and 1s
_BRACKETPAIR	At least one matching pair of brackets appears in the input; possible bracket types: ( ), [ ], { }
_EXCLAMATION	At least one exclamation mark (!) appears in the input
_EM3	Three (or more) exclamation marks (!!!) appear in a row in the input
_EMPTY	The input contains no text / the sentence text is empty
_NONSENSE	The input contains nonsense text, e.g., 'asdf', 'wgwwgwg', 'xxxxxx'
_QUESTION	At least one question mark (?) appears in the input
_QT3	Three (or more) question marks (???) appear in a row in the input
_QUOTE	At least one single quotation mark (') appears in the input
_DBLQUOTE	At least one quotation mark (") appears in the input

Special System Annotations

The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.

Annotation	Description
_INIT	Indicates session start, i.e., the first input in a dialogue
_TIMEOUT	Indicates the continuation of a previously timed-out session/dialogue

Basic Number Recognizer

The Basic Number Recognizer identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with an annotation associated with a variable which holds the actual numeric value of the number found.
The Basic Number Recognizer is language dependent and each language has its own configuration defining the decimal point characters and the thousands separator character to be ignored.

Annotation	Variable	Description
NUMBER	numericValue	Annotation created for identified Arabic numbers in user inputs

For the annotation and its numeric value variable to be added, a number in the user input must meet the following syntax:

It must match the regular expression:

[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+

It must be parseable by Java's BigDecimal to ensure it is a number

The above syntax provides the following guarantees:

The sign is not included in the annotated token
The numericValue variable contains a BigDecimal representation of the number.

In the above example regex, the dot is used as a decimal marker and the comma as a regular expression; as described earlier this configuration is language dependent and therefore varies depending on the selected solution language.

Language Detector

The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.

Annotation	Variable	Description
<language label>.LANG, e.g., %$DA.LANG	Confidence	Annotation created for the predicted language

The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS

A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.

Predict

The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.

When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:

the confidence is above the minimum confidence (defaults to 0.01)
the confidence is higher than 0.5 times the confidence value of the top class.

For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

Annotation	Variable	Variable	Variable	Description
<CLASS_NAME>.TOP_INTENT	classifier	confidence		Annotation created for the class with the highest confidence score
<CLASS_NAME>.INTENT	classifier	confidence	Order	Annotation given to each selected class with a maximum of five top classes

The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.