Auto Topic Discovery
As a company, Decooda operates in the text analytics space. Our clients rely on us to provide them with deep and meaningful insights into how their customers feel, how they talk, and what they are saying about a brand or service. One of the key steps involved in producing these insights is identifying these relevant topics that are being discussed.
Understanding speech is about more than just finding a keyword. In the past, linguists would have to read through a bunch of lengthy documents in order to manually create a list of topics and phrases from which classifiers could be created. In addition to being tedious, this process is time consuming, subjective, and lacks standardization. Decooda’s Auto Topic Discovery (ATD) technology is designed to bridge the gap between machine learning and natural language processing (NLP)—a branch of artificial intelligence that helps computers understand, interpret and emulate human language.
This hybridization between NLP and machine learning is at the core of what ATD does.
From Unquantifiable to Invaluable
Unstructured text is the largest human-generated data source, far outstripping any other data source created by man. On average, we produce more than 500 million tweets, 5.5 billion SMS text messages, and 280 billion emails each and every day—astronomical numbers that will only continue to grow. Between call logs from customers, support call logs, emails, surveys, product feedback, questionnaires, documents, reports, social media posts and text messages, as well as messaging systems such as Teams and Slack, organizations collect a massive amount of unstructured data each and every day—along with countless valuable insights buried within. By bringing these valuable insights to light, business leaders are better equipped to make decisions, inform product strategy, improve customer experience, coach and train employees and discover new ways of selling products to customers.
Decooda’s ATD technology allows our customers to rapidly cull through mountains of unstructured data in an automated fashion. Oftentimes, customers avoid new NLP projects because they feel overwhelmed, or perhaps data scientists are reluctant to annotate all of the data, which can be tedious. By nature, unstructured data is complicated and messy. Unlike structured data, it seems to be more complex and nuanced, refusing to fit neatly into columns and rows.
To make things even more complicated, conversations and written language contain everything from objective statements to subjective perspectives and opinions. In spoken language, sentences will typically convey emotion, sentiment, and the exact same sentence can mean different things to different people. The nuance of how a sentence is spoken, or even the inflection of a specific voice, can convey different meanings—sentiments that can be classified as positive, negative, or neutral. Decooda automatically assigns emotion and sentiment to Topics leveraging our patent pending technology.
ATD is comprised of two efforts: the use of machine learning and natural language processing techniques to rapidly curate vast amounts of data, and a classifier workbench designed to make the process of data wrangling and human curation rapid, repeatable, and easy. Designed to reduce the amount of human effort required to identify topics within mounds of unstructured data, ATD is capable of automatically capturing and categorizing all of the sentiment and emotion buried within.
Functioning as a semi-supervised hybrid between machine learning and natural language processing, it culls through mountains of unstructured data looking for specific classifiers. ATD preprocesses the input text to obtain clean text. It extracts n-grams and phrases, making use of frequency and parts of speech tags. Taking semantics and lexical similarities into account, topics can be identified among the phrases extracted and, through sentiment analysis, phrases can be mapped to respective topics.
By curating the machine learning-generated topics with some subject matter expertise, the user is able to define new training parameters which can then be resubmitted to Decooda 2’s auto discovery capability for results that are even more refined and nuanced. This results in an increasing level of granularity, recall, and precision.
Deconstructing Decooda 2
“Auto topic discovery” is the name we use for our hybrid NLP and machine learning pipeline. Currently, by using our workbench, users can select the pipeline of their choice from two primary options: Decooda 1 and, as explained here in greater detail, Decooda 2. Customers can leverage their own specific pipelines with Decooda technology.
The objective behind Decooda 2 is to reduce the amount of human effort that would typically be required to identify topics—what we call Classifiers—within a document set. Through a combination of machine learning and natural language process, it becomes possible to support and integrate multiple different approaches within a single model. That model represents the pipeline.
In the Decooda 2 pipeline, there are about 19 different steps, each of which empowers the user with unique capabilities for understanding the material submitted. Among many others, these high-level techniques include:
- Part Of Speech Tagging (to Determine Linguistically Salient Phrases)
- Lemmatization Keywords (to Concisely and Descriptively Name Topics)
- Contextual Similarity
- Semantic Similarity
- Context-Sensitive Word Vectors
- Phrase Frequency Analysis
Some of the models of algorithms we use within the pipeline are GloVe word vectors, Valence Aware Dictionary and sentiment Reasoner (VADER) for sentiment analysis, neural models for parts of speech tagging, rule-based lemmatizers, and cosine distance for semantic similarity to map phrases to topics.
Seamless Human-Machine Integration
One of the more powerful elements of ATD is its ability to provide a semi supervised or unsupervised machine learning approach that can be run by any layman with limited training. This means that a human can review the results of running Decooda 2 through a set of documents and then curate those documents. The semi-supervised process consists of submitting documents to the pipeline with a selected model and applying the seed file, using workbench to curate the results, and then resubmitting those curated results against new documents (or the same ones). Each iteration produces improvements in the machine learning and NLP until the results are submitted. Once the results are satisfactory to the user, the regular expressions (regex) are automatically generated and the results submitted to our core product. Core will then tag and score all submitted documents against the auto-generated regex classifiers.
We at Decooda understand that even the most sophisticated machine driven process can never fully replace human discretion and knowledge. Employing NLP algorithms to detect topics and associated phrases within those topics, the ATD tool makes it easy for users to modify and correct the results of machine learning. In other words, the algorithm is trainable. Customers can load in subject-matter databases to guide the NLP machine learning process, and human domain knowledge can be applied at any stage of the process to curate the results.
By substituting in the pipeline that best suits their needs, data scientists are able to define the parameters within each of the machine learning and NLP techniques used inside of ATD. This unprecedented amount of control bridges the gap between Decooda’s linguists and data scientists and those of our customers.
In the future, we’ll discuss how our workbench technology, our topic palette and our emotion palette make it possible for anyone to become a linguist.