One of the very first steps necessary in processing text is to break the text apart into tokens and to group those tokens into sentences. We use the word "tokens" and not "words" because tokens can also be things like:
- punctuation (exclamation points affect sentiment, for instance)
- links (http://...)
- possessive markers
For most European languages, tokenization is fairly straightforward - look for white space, look for punctuation, and the like. Each language has its own features, though - for instance, German makes extensive use of compound words and for some purposes such as sentiment it can be worth it to tokenize to the sub-word level.
Some languages, such as Chinese, have no space breaks between words and tokenizing those languages requires the use of more sophisticated statistical models. Lexalytics has developed tokenization models for all of our supported languages.
Most NLP and text mining tools make use not just of a bucket of tokens but also the parts of speech. Knowing what part of speech a token is makes it more useful. Proper nouns (e.g. Lexalytics) are more likely to be a mention of person, place, or company whereas adjectives (e.g. terrible) are more likely to be sentiment phrases, and so on. In most languages, single words can be of multiple speech types depending on context - "Love makes the world go round" has "love" as a noun, while "I love NLP" has love as a verb. Determining the part of speech for a token requires evaluating the context the word appears in.
Lexalytics has developed POS tagging models for most of its supported languages, and returns POS tags along with the text output if desired. Our set of POS tags is an extension of the Penn Treebank set of POS tags.
Below you will find a complete list of the supported tags and their meanings.
Plural common noun
Plural proper noun
Possessive pronoun (his, her my)
Personal pronoun (I, he, she, him, her)
Wh- pronoun (what, which, who, whom)
Wh- possessive pronoun (whose)
Modal verb (can, could, may, must)
Base verb (take)
Future tense, conditional
Past tense (took)
Gerund, present participle (taking)
Past participle (taken)
Present tense (take)
Present 3rd person singular (takes)
Adjective / Adverb
Wh- Adverb (how, where, why)
Determiner (a, the, an...)
Predeterminer (all, both...)
Wh- determiner (which)
Possessive Marker (', 's)
Hyphen / dash
Terminating punctuation (!, ., ?)
Preposition / subordinating conjunction
Words that should be possessives, but are missing a comma
Words that actually several words at once
Multiple Exclamation Marks or Question Marks
Chinese specific POS tags
Aspect marker (了)
Ba- construction (ex. 把)
的 in a relative clause
得 in V-得 constructions
等 meaning, etc.
Predicative adjective (ex. 很快 fast)
被 in passive construction
Localizer (ex. 上 on )
Measure word (ex. 个 )
Other particles (ex. 所 )
Sentence final particle (ex. 吗 )
Updated over 1 year ago