Extracting important words from a piece of text is known in NLP variously as theme, key word or key phrase extraction. At Lexalytics, we call it theme extraction. Themes are found by looking for phrases that match POS patterns - mostly descriptive noun phrases such as "delicious seafood" or "seared scallops." We do not extract single words as single words are often very noisy.
Themes do not extract entities, since entities are proper nouns. If you want to find companies, products, people and so on, see our entity recognition feature.
When we find candidate themes in the text, we also look at their relevance to the entire document. If the theme doesn't seem to be relevant to the rest of the document, it is dropped. If it is retained, it receives a relevance score. This score is only useful for ranking themes within a document against each other, not across documents. In addition to the score and the theme itself, we also report back a stemmed and lower-cased version of the theme, the sentiment, and the theme summary.
Themes are not easily tunable, unlike most of our other features, but you can use the blacklist feature in Semantria to suppress themes containing words you do not find useful or relevant. Additional tuning is possible via our professional services.
Updated about a year ago