Natural Language Processing & Text Mining

Text mining is everywhere, even if you’ve never heard of it. If you have asked Google a question, you’ve mined text. It is inescapable in our daily digital lives, from the filtering of spam in our mailboxes to chatting with chatbots. The fields of text mining and natural language processing are rapidly expanding and becoming more integrated into website visitor journeys. Models have been developed that can create summaries from long texts, retrieve the correct answers to questions from texts, and even write their own creative blogs.

The Basics of Text Mining

On a professional level, text mining is becoming more and more useful. For example, in the world of finance, text mining is creeping its way into stock market predictions, fraud detections, risk management software, and customer relationship management (read more). Machines and software are already analyzing and deploying language autonomously and this technology will only increase the need for text mining in daily and professional life.

We simply cannot ignore text; written language is everywhere and older than civilization.  With the rapid rise and spread of the internet and social media, text has become even more present as it’s gone digital. Digitalizing language allows us to engineer and program it to match the needs of our present generation.

Working with raw data is challenging enough, working with text data presents particluar challenges. For instance, how can you describe a collection of documents that have countless variations, syntax and lexicon, of words? How can you take the mean or median of a sentence or phrase? How do we manage words with double or triple meanings? The answer: text mining. Text mining focuses on how to convert tekst data into useful information and insights. Anyone aspiring to work in the world of data needs to have at least a basic understanding of text mining.

Challenges of Working With Text

Before engineering text data, it’s important to understand how it is unique from other forms of data, especially accounting for the variability of text. Most other types of data are often limited or clear in their definition; text data involves nearly endless variations. Ask ten different people to describe a Christmas tree right in front of them, and they will all come up with slightly different descriptions of that same tree. What if these people speak different languages? There are around 6.500 different languages across the world, each with their own unique grammar and vocabulary. With the global reach of the internet, text data carries thousands of potential alphabets.

Let’s assume the ten descriptions are in the same language, the variations humans recognize and process easily are confounding for computer programs. Our brains our hardwired at an early age to learn the basic rules of language and writing through socialization and education over many years. Computers need very clear parameters and instructions all at once before they can even approach the basics of language.

These Are The General Challenges:

  • Ambiguity is caused when the meaning of one word or sentence is unclear, or can even have multiple meanings. The phrase “Call me a taxi, please.” can mean two different commands without any context or social knowledge. For humans, ambiguity is often resolved by context and past experience. We would assume someone wants to go and needs transportation, a taxi, and not address the speaker as if their name is ‘a taxi’. Computer programs haven’t experienced the typical context; they have no previous knowledge about the world that would help make an instantaneous interpretation.
  • Synonyms are easier to understand for humans than computer programs. Everyone knows that “start, commence, begin, and initiate” mean (nearly) the same thing and catch the inherent nuances. A computer program does not have that inherent knowledge; it simply sees four different combinations of letters.
  • Morphology describes how words are formed, especially in relationship to other words and sentence clauses. Any English speaker can see that the words “wait, waiting, waited, waits,” are variations of the same word and imbued with the same meaning. Even if we don’t recognize a word, we can often infer its meaning from the context of similar words and the surrounding of the phrase or sentences. Once again, computer programs have a much more difficult time employing context and nuance to understand text.

The common theme: the rules of language, the nuance in variation and context are the Achilles heel of programming text data software and systems. Language is endless in its possible variation and forces us to be creative in how we engineer our machines to learn it. By using key principles of text mining it’s possible to greatly reduce, or at least catalog, this vast variability.

Reducing the Variability of Text

Text mining involves gaining insights from simplifying texts and documents into a format computers can process. But first, we need to reduce the variation and define our assumptions or parameters. We’ll assume that the different words in each text will contribute to a correct insight. It is important to leave as much potential for insights as possible while reducing the variation of the texts. The many techniques at your disposal are often surprisingly simple. The most common techniques for reducing variation in texts:

  1. Tokenize texts into individual words and/or sentences, essentially segmenting them into smaller parts. Computer programs ‘see’ an entire text as one object, but you want to work with individual words or sentences. You need to convert the information from one object to multiple objects.
  2. From uppercase to lowercase. ‘a’ is different from ‘A’ for a computer program. And while there is some information given by capitalization, it also greatly increases the variability of a piece of text. Therefore, one of the first steps in text mining is to convert your texts from mixed lowercase and uppercase to entirely lowercase. This can potentially reduce the different combinations of letters by 50%.
  3. Remove stop words. Some words occur very frequently, but don’t actually add much meaning to the sentence. Removing these words reduces a lot of variation in texts. Words like ‘the’, ‘a’, ‘and’, ‘for’ or ‘he’ don’t carry too much weight, and can therefore be omitted. There are standard lists of stopwords available, and you can always add your own custom words if there are other words that occur too frequently without providing extra meaning. Removing stop words will reduce the amount of data and ensure that the remaining words actually contribute to the meaning of the text.
  4. The other extreme involves removing very rare words. Some words appear infrequently, and thus it is difficult to label their meaning. It really depends on the case you’re working on. Depending on each case, you would want to keep them, while sometimes it is better to remove them. Carefully consider the added benefit from removing or keeping rare words.
  5. Remove punctuation. Besides letters, there are a lot of extra symbols in a written document. But again, these only add to the variability of the texts and combinations of words. They may make a text readable to a human, but they won’t help a computer program extract meaning from texts.
  6. Generalize types of words. Sometimes, texts might contain words or numbers that signal a category, such as email addresses and mobile phone numbers. Every phone number is different, but it’s always a set combination of numbers. For your use case it might not be useful in a use case to recognize individual telephone numbers, but it could be helpful to label the telephone numbers. To put this into practice, you could replace all telephone numbers with the same standard phrase. That way, you reduce the variability, but you’ll keep the most relevant information.

Retrieving Insights from Cleaned Text

All of these techniques are the first essential steps in working with text data. Executing them results in a clean organized dataset based on the original text. The next stage requires retrieving insights from the cleaned text.

Data scientists use two techniques when they start working with cleaned text data.

  • Bag-of-words model (BOW). This model helps you represent your text as a collection of the words; this keeps the frequency of words as a factor while removing the grammar. It’s one of the most basic, but very effective, methods of representing a text. It’s frequently used in document classification.
  • Term Frequency – Inverse Document Frequency (TF-IDF). TF-IDF is a step higher up from the bag-of-words model. While the BOW only takes the frequency of a word into account, TF-IDF also accounts for a word’s relative importance. It does this by increasing the TF-IDF value proportionally based on the frequency of a word and then adjusting this by reviewing the number of documents containing that word. It’s a simple, yet brilliant technique, and it’s often used to weigh terms while document searching.

NPL: Evolving and Expanding

Though BOW and TF-IDF may be basic, they are both still used and can help you gain immediate insights about your texts. These two techniques will be your first steps in your text mining adventure. Once these are mastered comes the really exciting stuff; their output is the raw material used to train machine learning models. These are just the basics of text mining, but there is a whole field of Natural Language Processing (NLP) that is expanding and evolving at the speed of technology. In this blog you can read more about advanced natural language processing techniques.