Tokenization and POS Tagging

One of the very first steps necessary in processing text is to break the text apart into tokens and to group those tokens into sentences. We use the word "tokens" and not "words" because tokens can also be things like:

  • punctuation (exclamation points affect sentiment, for instance)
  • links (http://...)
  • possessive markers

Language Approach

For most European languages, tokenization is fairly straightforward - look for white space, look for punctuation, and the like. Each language has its own features, though - for instance, German makes extensive use of compound words and for some purposes such as sentiment it can be worth it to tokenize to the sub-word level.

Some languages, such as Chinese, have no space breaks between words and tokenizing those languages requires the use of more sophisticated statistical models. Lexalytics has developed tokenization models for all of our supported languages.

Parts of Speech (POS)

Most NLP and text mining tools make use not just of a bucket of tokens but also the parts of speech. Knowing what part of speech a token is makes it more useful. Proper nouns (e.g. Lexalytics) are more likely to be a mention of person, place, or company whereas adjectives (e.g. terrible) are more likely to be sentiment phrases, and so on. In most languages, single words can be of multiple speech types depending on context - "Love makes the world go round" has "love" as a noun, while "I love NLP" has love as a verb. Determining the part of speech for a token requires evaluating the context the word appears in.

Lexalytics has developed POS tagging models for most of its supported languages, and returns POS tags along with the text output if desired. Our set of POS tags is an extension of the Penn Treebank set of POS tags.

Below you will find a complete list of the supported tags and their meanings.

Noun

NN

Common noun

NNS

Plural common noun

NNP

Proper noun

NNPS

Plural proper noun

PRP$

Possessive pronoun (his, her my)

PRP

Personal pronoun (I, he, she, him, her)

WP

Wh- pronoun (what, which, who, whom)

WP$

Wh- possessive pronoun (whose)

Verb

MD

Modal verb (can, could, may, must)

VB

Base verb (take)

VBC

Future tense, conditional

VBD

Past tense (took)

VBF

Future tense

VBG

Gerund, present participle (taking)

VBN

Past participle (taken)

VBP

Present tense (take)

VBZ

Present 3rd person singular (takes)

Adjective / Adverb

JJ

Adjective

JJR

Comparative adjective

JJS

Superlative adjective

RB

Adverb

RBR

Comparative adverb

RBS

Superlative adverb

WRB

Wh- Adverb (how, where, why)

Determiner

DT

Determiner (a, the, an...)

PDT

Predeterminer (all, both...)

WDT

Wh- determiner (which)

Symbol

SYM

Symbol

POS

Possessive Marker (', 's)

LRB

Open parenthesis

RRB

Close parenthesis

,

Comma

Hyphen / dash

:

Colon

;

Semi-Colon

.

Terminating punctuation (!, ., ?)

``

Open quote

"

Close quote

$

Currency symbol

Misc

CD

Cardinal

DAT

Date/Time

CC

Coordinating conjunction

EX

Existential there

FW

Foreign word

IN

Preposition / subordinating conjunction

RP

Particle

TO

To

UH

Interjection

Twitter

URL

URL

USER

Username

EMAIL

Email addresses

NNPOS

Words that should be possessives, but are missing a comma

UR

Words that actually several words at once

:)

Smiles

UH

Interjection

#TAG

Hashtag

Untagged Token

!!!

Multiple Exclamation Marks or Question Marks

...

Ellipsis

Chinese specific POS tags

AS

Aspect marker (了)

BA

Ba- construction (ex. 把)

DEC

的 in a relative clause

DEG

Associative/possessive 的

DER

得 in V-得 constructions

DEV

Adverbial 地

ETC

等 meaning, etc.

JJV

Predicative adjective (ex. 很快 fast)

LB

被 in passive construction

LC

Localizer (ex. 上 on )

M

Measure word (ex. 个 )

MSP

Other particles (ex. 所 )

SP

Sentence final particle (ex. 吗 )