Crash course on NLP; pre-processing and vectorization

Get your text data model ready

6 min readApr 18, 2023

Intro to NLP

Natural Language Processing (NLP) is a powerful tool for analyzing and understanding text data. As an analyst, you may be new to NLP and unsure of how to get started.

Overall data science pipeline for an NLP project

In this NLP crash course, we will cover the basics of NLP and its various techniques. In this first part, we will focus on pre-processing and vectorization, which are essential steps in preparing your data for NLP models.

Text preprocessing: This involves cleaning, lowercasing, tokenizing, removing stop words, and stemming/lemmatizing text data. This is done to remove unnecessary noise from our data, by keeping only the essential characteristics of our text.
Vectorization: This involves converting the preprocessed text data into a numerical representation that can be used by machine learning models. It is important to remember that computation takes place on numbers, and hence, our text data has to be mapped to a numeric representation.

This section is foundational, and regardless of what NLP task, or model you employ to achieve such task, you will need to understand clearly the contents of this article.

Text preprocessing techniques

Pre-processing involves cleaning and transforming raw text data into a format that can be easily analyzed by NLP models. The following are some common pre-processing techniques:

Tokens:

Tokenization is the process of breaking down a piece of text into individual words, phrases, or other meaningful elements, which are called tokens. In NLP, words are typically separated by spaces, but there are cases where words are combined (e.g. hyphenated words) or punctuation marks are used (e.g. full stops, commas, question marks) in the creation of tokens.

For example, consider the sentence: “The quick brown fox jumps over the lazy dog.” Tokenization of this sentence via spacing would produce the following tokens [“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”]

Stemming & lemmatization:

Stemming and lemmatization reduce words to their base or root forms. These actions can help reduce the number of unique words in a text, which can increase the effectiveness of the model to find patterns in the data.

Stemming involves removing the suffixes from words to obtain the stem or root form. For example, the stem of the word “running” is “run”, and the stem of the word “jumps” is “jump”.
Lemmatization, on the other hand, involves reducing words to their base or dictionary form, called the lemma. This process takes into account the part of speech of the word and its context in the sentence. For example, the lemma of the word “was” is “be”, and the lemma of the word “jumping” is “jump”.

Stop word removal:

Stopwords are common words in a language that are generally considered to be of low value in terms of meaning or relevance to a text. Examples of stopwords in English include “the”, “a”, “an”, “and”, “in”, “to”, “of”, “that”, “is”, “for”, “it”, “with”, and “as”.

Example of a word cloud analysis on data where stop words are not removed

Stopwords are often removed from a text during preprocessing because they can make it more difficult to analyze and understand the important content of the text. Removing stopwords can also help reduce the size of a text corpus, which can be important for computational efficiency and storage.

TF-IDF:

In essence, TF-IDF assigns each word in a document with a score that reflects how unique a word is to that document. This gives less importance to words that are often appearing across your documents and hence, are assumed to not contain specific information related to that specific document. TF-IDF can also be used to identify the topic of documents, as specific content related to a document is flagged through higher TF-IDF scores.

Vectorizing text; how computers interpret text data

Vectorization involves converting text data into a numerical format that can be used by NLP models. The following are some common vectorization techniques:

Bag of words / Count vectorizer:

In the BoW approach, a text document is first tokenized, which involves breaking the text into individual words. Then, a vocabulary of unique words is created by selecting all the distinct words that appear in the document.

Finally, a vector representation of the text is generated by counting the frequency of each word in the vocabulary within the text. The resulting vector is called the “bag of words” because it represents the text as an unordered collection of words.

For example, consider the sentence: “The cat in the hat chased the rat.” The BoW representation of this sentence would be a vector containing the counts of each word in the sentence: {the: 2, cat: 1, in: 1, hat: 1, chased: 1, rat: 1}.

In a BoW model, each word is converted to a one-hot encoded vector, where each word is assigned a unique index, and the vector is all zeros except for one at the index corresponding to the word.

While BoW is a simple model, it often is not able to capture the complexities of language, such as sarcasm, and information about the meaning or relationships between words.

Word embeddings:

Word embeddings are a type of numerical representation of words that capture the meaning and relationships between them in a way that can be understood by machine learning algorithms and contain real language context.

Word embeddings solve BoW limitations by representing each word as a vector of numbers, typically in a high-dimensional space. Each dimension in the vector represents a particular aspect of the word’s meaning, such as its syntactic or semantic properties.

The values in the vector are learned from a large corpus of text using unsupervised machine learning techniques, where the goal is to predict the context in which the word appears. By training on a large corpus, the embeddings capture the statistical relationships between words and their contexts.

Sentence embedding

Instead of embedding specific words, we can embed whole sentences. Sentence embeddings are a way to represent the meaning of a sentence as a vector, where each dimension of the vector represents a different aspect of the sentence’s meaning. Sentence embeddings are typically created using techniques such as averaging or pooling the word embeddings of the individual words in the sentence.

There are more sophisticated methodologies to create sentence embeddings beyond word embedding pooling. These new methodologies rely on more recent neural network architectures, and while it is not imperative to understand how these work in detail, it is important to recognize that more recent algorithms do not use word embeddings to arrive at sentence embeddings.

These newer methods can take into account the order and context of words in a sentence, as well as their relationship to one another, resulting in more nuanced and informative sentence embeddings

Conclusion

The preprocessing steps that an analyst takes can have a significant impact on the accuracy and performance of their machine learning models. Therefore, it is important for the analyst to carefully consider which preprocessing steps to take and which ones to omit.

For example, when deciding between word embeddings and count vectorizers, an analyst might choose word embeddings if they are interested in capturing the semantic relationships between words, such as synonyms or antonyms, or if they want to incorporate contextual information into their models. On the other hand, if the analyst is interested in identifying the most frequently occurring words in a corpus, a count vectorizer might be a better choice. This would be the case of trying to classify product names that are not necessarily real words to product categories.

Once the text data has been preprocessed and vectorized, the next step is to select a machine learning model to train on the data. This will be the focus of the next article in this chain.