Enterprise NLP in the cloud

The Semantria cloud-based sentiment analysis API is powered by our Salience engine, the industry leader after over 10 years of R&D and sentiment analysis experience and expertise.

Here, we give a brief summary of the terms and processes Semantria uses to give you quick and accurate text and sentiment analysis.

Sentiment Analysis

Sentiment analysis is the process of detecting positive, negative, or neutral feelings in a piece of writing. Humans have the innate ability to determine sentiment; in a business context, however, this process is time-consuming, inconsistent, and costly. It’s unrealistic for a human to read tens of thousands of user customer reviews one at a time for sentiment. That's where Semantria comes in.

Semantria’s cloud-based sentiment analysis software is based on Natural Language Processing and delivers you results that are more consistent than those of two humans (who statistically will only agree with each other 80% of the time). Semantria analyzes each document based on sophisticated algorithms developed to extract sentiment from your content similar to the way a human does – only 60,000 times faster.

Semantria extracts the sentiment of a document through the following steps:

  1. Semantria breaks the document into its basic parts of speech, called POS tags, which identify the structural elements of a sentence (e.g. nouns, adjectives, verbs, and adverbs).
  2. Algorithms identify sentiment-bearing phrases like "terrible service" or "cool atmosphere."
  3. Each sentiment-bearing phrase earns a score based on a logarithmic scale ranging between -1 and 1.
  4. Semantria combines the scores to determine the overall sentiment of the document or sentence. Document scores generally range between -1 and 1 but very positive or negative documents might exceed that.

For example, to calculate the sentiment of a phrase such as “terrible service,” Semantria uses search engine queries similar to the following:

“(Terrible service) near (good, wonderful, spectacular)”

“(Terrible service) near (bad, horrible, awful)”

Each result is added to a hit count; these are then combined using a mathematical operation called “log odds ratio” to determine the final score of a given phrase.

Concept Matrix and Deep Learning

Wikipedia is extraordinary. It's an encyclopedia, a dictionary, a thesaurus, a collection of semantic word nets-- in short, it's the largest source of knowledge on the internet. Every word links to other words, giving context and exposing fascinating relationships. We thought we'd take advantage of that huge knowledge base, so Semantria created the Concept Matrix. It uses Wikipedia's knowledge to understand and measure the semantic distances between words. The Concept Matrix is a key component in Semantria's deep learning algorithms which enable it to understand content in a human-like manner.

By using deep learning and the Concept Matrix while mining text, Semantria fully understands that "cat" is closely related to a lion and not an anaconda. Semantria recognizes that the concept of president can be bound to people like Obama, JFK, and Lincoln, or to positions like CEOs or Commander-and-Chief, depending on the context.

With such smart software, you can build a set of accurate categories quickly, efficiently and reliably.

Categorization

Training your engine for automated categorization using traditional methods is long and tedious. Standard categorization engines have you sit down and tell your engine what are and aren't parts of specific categories through a colossal of set of documents. Building 50 categories could easily take a team of trained linguists and statisticians weeks or months to complete. If you factor in the necessary maintenance for rogue and miscategorized content, automated classification becomes a very expensive ordeal.

Semantria’s API categorization feature uses the Concept Matrix to remove the headache that comes with building categories. Because the Concept Matrix draws on all of Wikipedia’s knowledge, words and concepts are already linked to one another. The engine needs only a few sample words and Semantria will build the entire category for you.

Consider “Beverages” as a category. To build this category, just give Semantria a few sample words like: Beverage, alcohol, and soda. When given the sentence, “Coca Cola returned to their original formula due to unfavourable consumer reviews,” Semantria will categorize it into "Beverages" in its analysis. Although “Coca Cola” was not used as a sample word, it was identified through the Concept Matrix.

The data from Wikipedia informs the Concept Matrix that Coca Cola is a popular soda with an internationally-known brand; thus the Concept Matrix determines that there’s a strong semantic link between “Coca Cola,” “beverage,” and “soda”.

Categories Guide

Named Entity Extraction

Semantria’s Named Entity Extraction (NER) feature automatically pulls proper nouns from text and determines their sentiment from the document. Entities like people, places, companies, brands, or job titles are classified, tagged and assigned a sentiment score. Named entity extraction gives you insight about what people are saying about your company and-- perhaps more importantly-- your competitors.

Semantria comes with a list of pre-installed entities so that you can get started immediately. You can also configure your own list of custom entities, specific to you.

Semantria extracts entities in the process below:

  1. Semantria breaks a document into POS tags, which are the basic parts of speech.
  2. A series of algorithms extracts Named Entities from the document.
  3. Each Named Entity has a set of associated parameters:
    • Entity -- The exact entity being extracted. Different names for the same reference are simplified to one name (e.g. Ol’ Blue Eyes, The Chairman of the Board, The Voice, Francis Albert and Boney Baritone are condensed to "Frank Sinatra").
    • Sentiment -- Positive, negative or neutral tone of all mentions of an entity.
    • Evidence -- The number of sentiment-bearing phrases associated with a given entity. Evidence has a score of 1-7, with 1 being fewest phrases and 7 being the most.
    • Confidence A numerical value from 0 to 1 that tells whether an entity matches your Boolean query. For example, "Evergreen" is both a consulting firm and a type of tree. Semantria distinguishes the difference by a confidence query configured as "Evergreen AND consulting." Most articles about evergreen trees will have a Confidence value of 0 and therefore be missed. If the query stated "Evergreen AND consulting OR trees," the Confidence value would be 1 because most articles will satisfy the boolean request.

Queries

Semantria’s Query feature uses boolean search functions so you can categorize your content and extract data with precision and simplicity.

Sentiment scores are applied to the query results, letting you measure the sentiment associated with keywords that trigger the query itself. Query sentiment is the most precise measure offered by Semantria.

Query sentiment is different from document sentiment. Query sentiment measures the sentiment of the query itself while the document sentiment measures the sentiment of the entire document.

Queries are very literal search engines. Unlike Categories, which rely on the Concept Matrix, Queries are built using boolean operators and logic. Any terms not explicitly in your query are ignored. Our advice: use Queries when you're looking for something particular and use Categorization when looking at broader subjects.

For example, when analyzing reviews of a hotel, the Categories feature will broadly sort your content into categories such as service, room, and staff. When using our Queries feature, however, you are able to further subdivide existing categories into labels such as "poor service," "dirty room," and "rude staff."

Queries Guide

Theme extraction

Semantria extracts themes within your content so that you can determine trends that appear over time. Themes are noun phrases extracted from text that can be used to identify the main ideas within your content. Semantria assigns a sentiment score to each extracted theme, so you'll understand the tone behind the themes.

After Semantria receives the text, the engine identifies the POS tags. Two simultaneous steps occur:

  1. Potential themes are extracted from POS tags and kept for scoring
  2. A process called Lexical Chaining occurs. In Lexical Chaining, Semantria will link sentences through synonyms or related nouns to establish a conceptual chain in the content.

Once Lexical Chaining and Potential Theme Extraction are complete, each theme is given a score. Potential themes that belong to the highest Lexical Chain have the highest score. The algorithm takes context and noun-phrase placement into account when scoring themes. If there are fewer than four chains in the given text, the algorithm scores purely based on count. In longer content, it is common for low-scoring noun phrases to get dropped from the theme output. Semantria has decided that phrase was not a significant point in the text.

For example:

We stayed at the Bellagio to celebrate my 30th birthday and it was great. The staff was always helpful, and very nice. The front desk was attentive and very pleasant. They came to our room twice a day once for cleaning, and a second for turn down service.

In this piece, front desk is a theme, but as the content lengthens, with no further mention of front desk, it gets dropped because it doesn't repeat anywhere else.

We stayed at the Bellagio to celebrate my 30th birthday and it was great. The staff was always helpful, and very nice. The front desk was attentive and very pleasant. They came to our room twice a day once for cleaning, and a second for turn down service. They upgraded our room for a discounted price and we had a beautiful view of the strip and lake, where we could see all of the fountain shows from our own window. The placement of the hotel on the strip is also very convenient as it is very close to the forum stores at Caesar's, and it is right across the street from the Paris hotel where there is some great eateries. The Bellagio also has some wonderful restaurants such as Noodles which is fairly priced for Vegas. Overall my stay at the Bellagio was far and above the expectations I had, and would go back in a heartbeat.

Summarization

Semantria's text summarization extracts the most relevant sentences from your document so you can quickly understand the main ideas without all the grunt work. Semantria provides an abridged version of your content-- 3 sentences by default-- which supplements the analytic data from other parts of the engine.

Semantria summarizes on a sentence-level basis, which means that it will select sentences most pertinent to your content. Together these sentences give you a concise synopsis of the original source text.

Like with Theme extraction, Semantria uses Lexical Chaining to select the most relevant information. The engine establishes a conceptual chain through related nouns, even when the concepts are from different parts of the document. Semantria sifts through nonessential or unrelated sentences to give you the big idea succinctly.

The first sentence of your summary will be the longest Lexical Chain Semantria finds-- this should best represent the idea of your document. Your summary will continue with each sentence in decreasing order your summary, sentence by sentence, will go from longest lexical chain in decreasing order.

To change the default number of sentences for a summary, go to your profile in the Excel Add-In; for API users, go to configuration.

Facets and attributes

Facets extract key points from your document and list the most important ideas with their accompanying attributes. Facets are similar to Themes, but Themes rely solely on noun phrases for analysis. Facets instead rely on Subject Verb Object (SVO) parsing, so they find trends even when there are weak or no noun phrases in your text. Facets give you the information to construct more specific and relevant queries to dig deep into your content.

Consider the following sentence:

My waiter was rude.

There are no noun phrases in the sentence, so Semantria will not recognize it as a Theme. It does contain a subject, verb, and object, however-- Semantria applies SVO and extracts the information as a facet. In this case, "waiter" is the facet and "rude" is the attribute.

Semantria will output all facets that appear twice in your document.

Language coverage

Semantria offers a full suite of NLP features for many languages: English, French, Spanish, Portuguese, Italian, German, Mandarin, Dutch, Korean, Japanese, Malay, Indonesian, and Singlish. In addition, Semantria offers a more limited set of NLP features for Arabic, Russian, Swedish, Danish and Norwegian.

Crawling and content analysis

Semantria’s partner, Diffbot, provides you with quick and easy way to retrieve the raw data you need for analysis. Diffbot’s APIs allow you to, among other things, automatically crawl entire sites, identify pages that contain articles, and send the text to Semantria for full analysis.

Diffbot + Semantria

With Diffbot's Computer Vision technology and Semantria’s NLP solution, you can understand the sentiments and ideas surrounding any product, service, or brand without having to read or copy and paste anything from the entire website.

Diffbot's Article API takes text posts and categorizes the text into elements like title, author, publishing date, and body text using Computer Vision technology. It strips away all extraneous information and sends the relevant text to Semantria for analysis. Diffbot's Computer Vision emulates how a human looks at a page, so it takes more than a page's HTML into account to determine useful text. Computer Vision ensures that Semantria will process clean text and you will get actionable data.

USA/Canada: 1-877-570-1840