Lexalytics was the first vendor to ship a commercial sentiment engine, way back in 2003. Lexalytics measures sentiment from negative to positive, and returns a score running from negative to positive. The bigger the number, the more negative or positive the text is. It is up you to interpret that score - do you want to cut it into a three-point scale (negative, neutral, positive), a five-point scale (1 to 5), or, as some PR agencies do, "negative" and "everything else?"
Our Semantria product returns both the score and a three-point scale of negative, neutral and positive based on a default neutral range of -0.05 to +0.22. You can use our default sentiment range, or you can try your own.
Sentiment applies to just about everything. A sentiment score is returned for:
- a whole document
- each entity
- each query and category
- each theme
This allows you to drill deeper into the discussion - is the theme of "customer service" positive or negative? What is the distribution of sentiment for the query "check in" OR "front desk" in my data set?
In addition to the overall score, we returned the individual phrases that were found, along with their scores, and a lot more information on each one.
Lexalytics offers two methods of calculating sentiment. One works out of the box and is called phrase-based sentiment. The other is called model-based sentiment and requires you to build a machine learning model off your data to generate sentiment. Phrase based sentiment works well for most content sets and is easily configurable by a non-expert.
The short answer:
We find sentiment words in a clever way and add up their scores.
For the long answer see the below sections.
Phrase-based sentiment builds on the POS (Part of Speech) tagging Lexalytics performs. When calculating sentiment for a document, the POS for words and phrases are evaluated against POS patterns and those that match the allowable patterns are then looked up in a large dictionary of pre-scored words and phrases. Why do we need POS patterns? Because a word might be sentiment bearing when it is one POS and non sentiment bearing in another. For instance, "love" is positive as a verb ("I love NLP"), but not positive when its part of a name (Courtney Love).
The pre-scored dictionary of words and phrases has scores ranging from -1 (always strongly negative) to +1 (always strongly positive) but most words and phrases are not at the ends, because many words can be more or less strong in different contexts.
Once we have identified the sentiment words and have received the scores from the dictionary, we look for modifiers to those phrases. The most common modifiers are negators (e.g. not, never) and intensifiers (e.g. very, somewhat). These multiply the score of the phrase. A negator generally flips the sentiment score to the reverse of the dictionary value, while an intensifier might increase the total score (very) or decrease it (somewhat). Additional non-obvious intensifiers include things like comparative clauses - we weight the conclusion of a clause more heavily than the beginning, and we also try to eliminate boilerplate, or cliched expressions. In an expression like "The food was bland, but we had a good time anyways", we de-weight "bland" while unweighting "good time". "Good morning" is not really sentiment bearing on its own, its just a standard greeting, while "I had a really good morning today" is sentiment bearing. Phrases we think are boilerplate have their sentiment scores heavily de-weighted.
At this point we have a list of words with modified scores, and we then add up the scores, weighting the total score by the number of words.
All of this applies to whatever the text is - a whole document, an entity mention, a theme, and so on. For anything smaller than the whole document, we create a type of summary called a lexical chain, and calculate the sentiment based on that.
The basic steps in calculating sentiment are as follows:
- Sentiment phrases are located in the content
- Sentiment phrases in questions are thrown out
- The score is decreased if the sentence is not polar
- The score is increased if there in an exclamation
- The score is increased if it is a superlative (such as fast/fastest)
- The score is increased if it is repeated
- De-compound hashtags for sentiment
- The scores are added up and averaged
Model-based sentiment is a technique of training a machine learning model on a set of texts that have been scored as negative, neutral or positive by humans. If you have your own set of data already classified by humans into those buckets, then building a model might be a good way for you to get sentiment tuned to your needs.
Model sentiment applies only to the document level, and reports probabilities for the document to fall into negative, neutral or positive. If you need sentiment for entities or queries, you will need phrase-based sentiment instead of model sentiment.
Updated 8 months ago