Text preprocessing using NLP

Sunilkumar Prajapati
2 min readMar 25, 2023

--

Introduction

Natural Language Processing (NLP) is a subfield of artificial intelligence that deals with the interaction between computers and humans using natural language. NLP is a crucial component of many modern applications, such as chatbots, sentiment analysis, and machine translation. However, before applying NLP techniques to any text data, the text needs to be preprocessed to ensure it is in a suitable format for the analysis. This blog post will provide a step-by-step guide to text preprocessing using NLP techniques, with examples.

Step 1: Lowercasing

The first step in text preprocessing is to lowercase all the text. Lowercasing makes the text uniform and reduces the number of unique words in the dataset. This step is essential because words in different cases are considered other words by the computer.

Example:

text = “The quick brown FOX jumps over the Lazy dog”
text = text.lower()
print(text)

Output:

the quick brown fox jumps over the lazy dog

Step 2: Tokenization

Tokenization is the process of splitting the text into individual words or tokens. This step is necessary because most NLP algorithms operate at the word level, and analyzing individual words provides more insights into the text.

Example:

from nltk.tokenize import word_tokenize
text = “The quick brown fox jumps over the lazy dog”
tokens = word_tokenize(text)
print(tokens)

Output:

[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]

Step 3: Removing Stop Words

Stop words are common words that do not contribute much to the overall meaning of the text. Examples of stop words include “the,” “a,” “an,” “is,” “and,” etc. Removing stop words helps to reduce the number of words in the dataset and improve the efficiency of the analysis.

Example:

from nltk.corpus import stopwords

stop_words = set(stopwords.words(‘english’))

tokens = [‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
filtered_tokens = [word for word in tokens if not word in stop_words]

print(filtered_tokens)

Output:

[‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’]

Step 4: Stemming and Lemmatization

Stemming and Lemmatization are techniques used to reduce words to their base or root form. This step is important because it reduces the number of unique words in the dataset and helps to capture the essence of the text. Stemming involves removing the suffixes from words, while lemmatization involves reducing words to their base form using a dictionary lookup.

Example:

from nltk.stem import PorterStemmer, WordNetLemmatizer

porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

tokens = [‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’]
stemmed_tokens = [porter.stem(word) for word in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print(stemmed_tokens)
print(lemmatized_tokens)

Output:

[‘quick’, ‘brown’, ‘fox’, ‘jump’, ‘lazy’, ‘dog’]
[‘quick’, ‘brown’, ‘fox’, ‘jump’, ‘

--

--