Natural language processing (NLP) has developed rapidly in recent years and is improving our lives in many ways. So, what exactly is NLP and how does it work? Here’s a simple introduction.
- What is natural language processing?
- What can natural language processing do?
- How does natural language processing work?
What is natural language processing?
Natural language processing, or NLP, is a type of artificial intelligence (AI) that specializes in analyzing human language.
It does this by:
- Reading natural language, which has evolved through natural human usage and that we use to communicate with each other every day
- Interpreting natural language, typically through probability-based algorithms
- Analyzing natural language and providing an output
Have you ever used Apple’s Siri and wondered how it understands (most of) what you’re saying? This is an example of NLP in practice.
NLP is becoming an essential part of our lives, and together with machine learning and deep learning, produces results that are far superior to what could be achieved just a few years ago.
In this article, we’ll take a closer look at NLP, see how it’s applied, and learn how it works.
What can natural language processing do?
NLP is used in a variety of ways today. These include:
When was the last time you visited a foreign country and used your smartphone for language translation? Perhaps you used Google Translate? This is an example of NLP machine translation.
Machine translation works by using NLP to translate one language into another. Historically, simple rules-based methods have been used to do this. But today’s NLP techniques are a big improvement on the rules-based methods that have been around for years.
For NLP to do well at machine translation it employs deep learning techniques. This form of machine translation is sometimes called neural machine translation (NMT), since it makes use of neural networks. NMT, therefore, interprets language based on a statistical, trial and error approach and can deal with context and other subtleties of language.
In addition to applications like Google Translate, NMT is also used in a range of business applications, such as:
- Translating plain text, web pages or files such as Excel, Powerpoint or Word. Systran is an example of a translation services company that does this.
- Translating social feeds in real-time, as offered by SDL Government, a company specializing in public sector language services.
- Translating languages in medical situations, such as when an English-speaking doctor is treating a Spanish-speaking patient, as offered by Canopy Speak.
- Translating financial documents such as annual reports, investment commentaries and information documents, as offered by Lingua Custodia, a company specializing in financial translations.
Earlier, we mentioned Siri as an example of NLP. One particular feature of NLP used by Siri is speech recognition. Alexa and Google Assistant (“ok Google”) are other well-known examples of NLP speech recognition.
Speech recognition isn’t a new science and has been around for over 50 years. It’s only recently though that its ease-of-use and accuracy have improved significantly, thanks to NLP.
At the heart of speech recognition is the ability to identify spoken words, interpret them and convert them to text. A range of actions can then follow such as answering questions, performing instructions, or writing emails.
The powerful methods of deep learning used in NLP allow today’s speech recognition applications to work better than ever before.
Chatbots are software programs that simulate natural human conversation. They are used by companies to help with customer service, consumer queries, and sales enquiries.
You may have interacted with a chatbot the last time you logged on to a company website and used their online help system.
While simple chatbots use rules-based methods, today’s more capable chatbots use NLP to understand what customers are saying and how to respond.
Well known examples of chatbots include:
- The World Health Organization (WHO) chatbot, built on the WhatsApp platform, which shares information and answers queries about the spread of the COVID-19 virus
- National Geographic’s Genius chatbot, that speaks like Albert Einstein and engages with users to promote the National Geographic show of the same name
- Kian, Korean car manufacturer Kia’s chatbot on FaceBook Messenger, that answers queries about Kia cars and helps with sales enquiries
- Whole Foods’ chatbot that help with recipe information, cooking inspiration and product recommendations
Sentiment analysis uses NLP to interpret and classify emotions contained in text data. This is used, for instance, to classify online customer feedback about products or services in terms of positive or negative experiences.
In its simplest form, sentiment analysis can be done by categorizing text based on designated words that convey emotion, like “love”, “hate”, “happy”, ”sad” or “angry”. This type of sentiment analysis has been around for a long time but is of limited practical use due to its simplicity.
Today’s sentiment analysis uses NLP to classify text based on statistical and deep learning methods. The result is sentiment analysis that can handle complex and natural-sounding text.
There’s a huge interest in sentiment analysis nowadays from businesses worldwide. It can provide valuable insights into customer preferences, levels of satisfaction, and feedback on opinions which can help with marketing campaigns and product design.
Email overload is a common challenge in the modern workplace. NLP can help to analyze and classify incoming emails so that they can be automatically forwarded to the right place.
In the past, simple keyword-matching techniques were used to classify emails. This had mixed success. NLP allows a far better classification approach as it can understand the context of individual sentences, paragraphs, and whole sections of text.
Given the sheer volume of emails that businesses have to deal with today, NLP-based email classification can be a great help in improving workplace productivity. Classification using NLP helps to ensure that emails don’t get forgotten in over-burdened inboxes and are properly filed for further action.
|How NLP is Used
|Translating emails, web pages, social feeds, financial documents, and real-time travel and medical situations
|Simulate natural human conversation
|Online help, product queries, customer engagement
|Identify spoken words and convert them to text
|Siri, Alexa, Google Assist
|Interpret and classify emotions contained in text data
|Classify online customer comments into positive or negative feedback, asses feedback on marketing campaigns
|Classify and sort emails
|Automatic forwarding of emails to target folders
How does natural language processing work?
Now that we’ve seen what NLP can do, let’s try and understand how it works.
In essence, NLP works by transforming a collection of text information into designated outputs.
If the application is machine translation, then the input text information would be documents in the source language (say, English) and the output would be the translated documents in the target language (say, French).
If the application is sentiment analysis, then the output would be a classification of the input text into sentiment categories. And so on.
The NLP workflow
Modern NLP is a mixed discipline that draws on linguistics, computer science, and machine learning. The process, or workflow, that NLP uses has three broad steps:
Step 1 – Text pre-processing
Step 2 – Text representation
Step 3 – Analysis and modeling
Each step may use a range of techniques that are constantly evolving with continued research.
Step 1: Text pre-processing
The first step is to prepare the input text so that it can be analyzed more easily. This part of NLP is well established and draws on a range of traditional linguistic methods.
Some of the key approaches used in this step are:
- Tokenization, which breaks up text into useful units (tokens). This separates words using blank spaces, for instance, or separates sentences using full stops. Tokenization also recognizes words that often go together, such as “New York” or “machine learning”. As an example, the tokenization of the sentence “Customer service couldn’t be better” would result in the following tokens: “customer service”, “could”, “not”, “be” and “better”.
- Normalization transforms words to their base form using techniques like stemming and lemmatization. This is done to help reduce ‘noise’ and simplify the analysis. Stemming identifies the stems of words by removing their suffixes. The stem of the word “studies”, for instance, is “studi”. Lemmatization similarly removes suffixes, but also removes prefixes if required and results in words that are normally used in natural language. The lemma of the word “studies”, for instance, is “study”. In most applications, lemmatization is preferred to stemming as the resulting words have more meaning in natural speech.
- Part-of-speech (POS) tagging draws on morphology, or the study of inter-relationships between words. Words (or tokens) are tagged based on their function in sentences. This is done by using established rules from text corpora to identify the purpose of words in speech, ie. verb, noun, adjective etc.
- Parsing draws on syntax, or the understanding of how words and sentences fit together. This helps to understand the structure of sentences and is done by breaking down sentences into phrases based on the rules of grammar. A phrase may contain a noun and an article, such as “my rabbit”, or a verb as in “likes to eat carrots”.
- Semantics identifies the intended meaning of words used in sentences. Words can have more than one meaning. For example “pass” can mean (i) to physically hand over something, (ii) a decision to not take part in something, or (iii) a measure of success in an exam. A word’s meaning can be understood better by looking at the words that appear before and after it.
Step 2: Text representation
In order for text to be analyzed using machine and deep learning methods, it needs to be converted into numbers. This is the purpose of text representation.
Some key methods used in this step are:
Bag of words
Bag of words, or BoW, is an approach that represents text by counting how many times each word in an input document occurs in comparison with a known list of reference words (vocabulary).
The result is a set of vectors that contain numbers depicting how many times each word occurs. These vectors are called ‘bags’ as they don’t include any information about the structure of the input documents.
To illustrate how BoW works, consider the sample sentence “the cat sat on the mat”. This contains the words “the”, “cat”, “sat”, “on” and “mat”. The frequency of occurrence of these words can be represented by a vector of the form [2, 1, 1, 1, 1]. Here, the word “the” occurs twice and the other words occur once.
When compared with a large vocabulary, the vector will expand to include several zeros. This is because all of the words in the vocabulary which aren’t contained in the sample sentence will have zero frequencies against them. The resulting vector may contain a large number of zeros and hence is referred to as a ‘sparse vector’.
The BoW approach is fairly straightforward and easy to understand. The resulting sparse vectors however can be very large when the vocabulary is large. This leads to computationally challenging vectors that don’t contain much information (ie. are mostly zeros).
Further, BoW looks at individual words, so any information about words that go together is not captured. This results in a loss of context for later analysis.
Bag of n-grams
One way of reducing the loss of context with BoW is to create vocabularies of grouped words rather than single words. These grouped words are referred to as ‘n-grams’, where ‘n’ is the grouping size. The resulting approach is called ‘bag of n-grams’ (BNG).
The advantage of BNG is that each n-gram captures more context than single words.
In the earlier sample sentence, “sat on” and “the mat” are examples of 2-grams, and “on the mat” is an example of a 3-gram.
One issue with counting the number of times a word appears in documents is that certain words start to dominate the count. Words like “the”, “a” or “it”. These words tend to occur frequently but don’t contain much information.
One way to deal with this is to treat words that appear frequently across documents differently to words that appear uniquely. The words appearing frequently tend to be low-value words like “the”. The counts of these words can be penalized to help reduce their dominance.
This approach is called ‘term frequency – inverse document frequency’ or TF-IDF. Term frequency looks at the frequency of a word in a given document while the inverse document frequency looks at how rare the word is across all documents.
The TF-IDF approach acts to downplay frequently occurring words and highlight more unique words that have useful information, such as “cat” or “mat”. This can lead to better results.
A more sophisticated approach to text representation involves word embedding. This maps each word to individual vectors, where the vectors tend to be ‘dense’ rather than ‘sparse’ (ie. smaller and with fewer zeros). Each word and the words surrounding it are considered in the mapping process. The resulting dense vectors allow for better analysis and comparison between words and their context.
Word embedding approaches use powerful machine learning and deep learning to perform the mapping. It is an evolving area that has produced some excellent results. Key algorithms in use today include Word2Vec, GloVe, and FastText.
|Bag of words
|Count occurrences of input words against vocabulary
|Simple and relatively easy
|Produces sparse vectors, does not capture context
|Bag of n-grams
|Count occurrences of n-grams (eg. word pairs or word triplets) against vocabulary
|Relatively easy and allows for some context capture
|Also produces sparse vectors and may require large vocabularies of n-grams
|Penalize frequent words and highlight unique words
|Emphasizes more meaningful words
|Similar limitations as Bag of words
|Map words to vectors, allowing context and better comparisons
|More sophisticated capture of context and computationally more efficient (smaller vectors)
|One-to-one mapping of words to vectors, struggles when words have more than one meaning
Step 3: Analysis and modeling
The final step in the NLP process is to perform calculations on the vectors generated through steps 1 and 2, to produce the desired outcomes. Here, machine learning and deep learning methods are used. Many of the same machine learning techniques from non-NLP domains, such as image recognition or fraud detection, may be used in this analysis.
Consider sentiment analysis. This can be done using either supervised or unsupervised machine learning. Supervised machine learning requires pre-labeled data while unsupervised machine learning uses pre-prepared databases of curated words (lexicons) to help with classifying sentiment.
Using machine learning, input text vectors are classified using a probabilistic approach. This is done through either a trained model (supervised machine learning) or by comparison with a suitable lexicon (unsupervised machine learning).
The outcomes are sentiment classifications based on the probabilities generated through the machine learning process.
NLP is developing rapidly and is having an increasing impact on society. From language translation to speech recognition, and from chatbots to identifying sentiment, NLP is providing valuable insights and making our lives more productive.
Modern NLP works by using linguistics, computer science, and machine learning. Over recent years, NLP has produced results that far surpass what we’ve seen in the past.
The basic workflow of NLP involves text pre-processing, text representation, and analysis. A variety of techniques are in use today and more are being developed with ongoing research.
NLP promises to revolutionize many areas of industry and consumer practice. It’s already become a familiar part of our daily lives.
With NLP, we have a powerful way of engaging with a digital future through a medium we are inherently comfortable with – our ability to communicate through natural language.