3 NLTK Strategies for Advanced Text Preprocessing and Linguistic Analysis

# Introduction
Natural language processing (NLP) has undergone an apparent paradigm shift in recent years, with large-scale linguistic models (LLMs) and transformers handling complex end-to-end cognitive tasks. However, in any effective NLP workflow, the raw text must still be tokenized, normalized, and analyzed before reaching the model. While modern NLP libraries and natural programs such as SpaCy or Hugging Face are excellent for building deep learning pipelines for general purpose or integration with LLMs, Natural Language Toolkit (NLTK) remains a viable, transparent choice for a structured programming language, custom text normalization, and statistical corpus analysis.
Unfortunately, many developers mistakenly believe that LLMs render native text processing obsolete, or write text processing code in advance using nonsensical methods that discard critical language structures. They break down buzzwords like “machine learning” into separate, meaningless words; they perform context-aware lemmatization that yields inaccurate basic forms; or rely on simple frequency counts that miss meaningful word associations.
To build robust, mathematically accurate NLP models, you need to preserve structural and linguistic context in the processing phase. In this article, we'll go over three key NLTK tricks to increase the pre-processing of your text:
- to maintain the integrity of names and
MWETokenizer - context-aware lemmatization with Part-of-Speech (POS) mapping.
- derivation of equations using clustering methods
# 1. Domain Term Saving with Multi-Word Expression Tokenizer
Tokenization is the foundation of any NLP pipeline. However, standard tokens separate sentences strictly with white space and punctuation. This becomes a problem when dealing with multi-word domain expressions – eg "neural network", "decision tree"or "San Francisco" – where individual words come together to form a single semantic concept.
If the tokenizer splits "neural network" in the middle "neural" again "network"a downstream vectorizer (such as Bag-of-Words or TF-IDF) will treat them as unrelated features, cleaning the signal and introducing noise. Developers often try to fix this by writing regular search expressions and replacing them with raw text before generating tokens.
Using character level conversion (eg text.replace("neural network", "neural_network")) is painful. It fails to respect word boundaries, mishandles punctuation, and is incredibly fast to run on all large datasets. An improved approach is to tokenize the text first and then use NLTK's native ones MWETokenizer combining these tokens cleanly.
A nonsensical method of regex replacement relies on character-level string manipulation, which doesn't fit well and can't replace substrings within unrelated words:
import re
import time
# Sample corpus
raw_texts = [
"We are studying neural networks and deep learning.",
"The decision tree is a popular model in machine learning.",
"A neural network can have many layers."
] * 5000
cleaned_texts = []
for text in raw_texts:
# Manual string replacements for domain terms
text = re.sub(r"bneural networks?b", "neural_network", text, flags=re.IGNORECASE)
text = re.sub(r"bdecision trees?b", "decision_tree", text, flags=re.IGNORECASE)
text = re.sub(r"bmachine learnings?b", "machine_learning", text, flags=re.IGNORECASE)
# Tokenize the processed string
tokens = text.lower().split()
cleaned_texts.append(tokens)
print("Sample tokens:", cleaned_texts[0])
Output:
Sample tokens: ['we', 'are', 'studying', 'neural_network', 'and', 'deep', 'learning.']
Now let's try to use NLTK tokens. We start creating tokens using the standard word_tokenize method and pass the token streams through initialized MWETokenizer which handles combining token boundaries properly:
import nltk
from nltk.tokenize import word_tokenize, MWETokenizer
import time
# Ensure NLTK resources are downloaded
nltk.download('punkt', quiet=True)
raw_texts = [
"We are studying neural networks and deep learning.",
"The decision tree is a popular model in machine learning.",
"A neural network can have many layers."
] * 5000
# Initialize tokenizer and register MWE tuples
mwe_tokenizer = MWETokenizer([
('neural', 'network'),
('neural', 'networks'),
('decision', 'tree'),
('decision', 'trees'),
('machine', 'learning')
], separator="_")
cleaned_texts_mwe = []
for text in raw_texts:
# Tokenize words using NLTK's standard tokenizer
tokens = word_tokenize(text.lower())
# Merge specified multi-word expressions
merged_tokens = mwe_tokenizer.tokenize(tokens)
cleaned_texts_mwe.append(merged_tokens)
print("Sample tokens:", cleaned_texts_mwe[0])
We get the same output, but in a more elegant and linguistically precise – and scalable – way:
Sample tokens: ['we', 'are', 'studying', 'neural_network', 'and', 'deep', 'learning.']
Using the MWETokenizer changes functionality from low-level string matching to token-level matching.
- We define polynomial expressions as tuples of independent tokens:
('neural', 'network'). - By placing
separator="_"token combines the same sequence into a single string token:"neural_network". - Because it works directly on token arrays, it is immune to cross-border bugs and handles string punctuation (such as
"neural networks."split in half"neural","networks","."first, then connect safely to"neural_networks",".") correctly. It runs fast and scales cleanly for hundreds of domain terms.
# 2. Context-Aware Lemmatization with POS-Tag Mapping
Lemmatization is the process of reducing a word to its basic lexical form (its lemma) — “running” -> “run”, “better” -> “good”. This is an important step in normalization, as it combines the same word system variables together.
However, NLTK WordNetLemmatizer defaults to treating each word as a noun. If you pass verbs or adjectives without specifying their POS class, the lemmatizer will return the word unchanged. For example:
lemmatizer.lemmatize("running")harvest"running"(instead of “run”)lemmatizer.lemmatize("better")harvest"better"(instead of “good”)
To solve this, we must dynamically identify the grammatical role of each word in the sentence using NLTK's POS tag, map these tags to simplified WordNet categories (noun, verb, adjective, adverb), and pass them to the narrator.
This silly method feeds words directly into the lemmatizer. It misses verb and adjective inflections, resulting in very limited vocabulary:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
sentence = "The feet of the running runners are getting better and faster."
tokens = word_tokenize(sentence.lower())
lemmatizer = WordNetLemmatizer()
# Naive lemmatization: assumed to be all nouns
naive_lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print("Tokens: ", tokens)
print("Naive Lemmas:", naive_lemmas)
Output:
Tokens: ['the', 'feet', 'of', 'the', 'running', 'runners', 'are', 'getting', 'better', 'and', 'faster', '.']
Naive Lemmas: ['the', 'foot', 'of', 'the', 'running', 'runner', 'are', 'getting', 'better', 'and', 'faster', '.']
Let's look at the optimized version: we write a clean helper dictionary drawing tags from the Penn Treebank (returned by NLTK's pos_tag) in the WordNet POS constants, ensuring that all word types are spelled correctly:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
# Download POS tagger resources
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
sentence = "The feet of the running runners are getting better and faster."
tokens = word_tokenize(sentence.lower())
# Generate POS tags for each token
pos_tags = nltk.pos_tag(tokens)
# Map Penn Treebank tags to WordNet tags
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
# Default to WordNet's default noun handling
return None
lemmatizer = WordNetLemmatizer()
# Lemmatize utilizing mapped POS tags
context_lemmas = []
for token, tag in pos_tags:
wn_tag = get_wordnet_pos(tag)
if wn_tag:
lemma = lemmatizer.lemmatize(token, pos=wn_tag)
else:
lemma = lemmatizer.lemmatize(token)
context_lemmas.append(lemma)
print("POS Tagged: ", pos_tags)
print("Context Lemmas:", context_lemmas)
Output:
POS Tagged: [('the', 'DT'), ('feet', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('running', 'NN'), ('runners', 'NNS'), ('are', 'VBP'), ('getting', 'VBG'), ('better', 'RBR'), ('and', 'CC'), ('faster', 'RBR'), ('.', '.')]
Context Lemmas: ['the', 'foot', 'of', 'the', 'running', 'runner', 'be', 'get', 'well', 'and', 'faster', '.']
NLTK data pos_tag tag words using the Penn Treebank tagset (eg 'VBG' with the gerund verb, 'JJR' for a comparative adjective).
- Our assistant's work
get_wordnet_pos()checks the first character of the tag. Compliant with WordNet's POS standards, if it starts with 'J', we map to WordNet's Adjective tag (wordnet.ADJ); if it starts with 'V', it goes to Verb (wordnet.VERB), and so on. - By supplying the correct POS tag
lemmatizer.lemmatize(token, pos=wn_tag)the lemmatizer successfully resolves “run” to “run”, “is” to “be”, “get” to “get”, “better” to “good”, and “quick” to “quick”. This preserves the semantic context of the sentence, greatly reducing the word size in the underlying ML models.
# 3. Derivation of Mathematical Sentences using Collocation Finders
Extracting key phrases or multi-word concepts from text is important for topic mapping, search indexing, and sentiment analysis. These phrases are known as collocations, which are sequences of words that occur together more often than would be expected by chance.
A foolproof way to find collocations is to count all raw bigrams (sequences of two words) and sort them iteratively. However, this method produces very uninformative pairs. Due to the raw frequency distribution, compounds such as “of the”, “in the”, and “in a” will always dominate the top results. Even after filtering out fixed terms, raw counting can favor random, coincidental pairings that happen to repeat themselves several times.
An improved solution is to use NLTK's BigramCollocationFinder combined with statistical correlation metrics. Instead of calculating the raw frequency, we use correlation methods such as Pointwise Mutual Information (PMI) or Chi-Square statistics. These metrics check whether two words appear together more often than they would by chance.
First, our naive method simply counts the raw bigrams and cuts the top matches, capturing the sound and active words:
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import bigrams
# Sample corpus
corpus = """
Natural language processing is an active field of AI. Machine learning plays a key role
in natural language processing. Deep learning architectures have revolutionized natural
language processing. We need machine learning models to solve these natural language tasks.
"""
tokens = word_tokenize(corpus.lower())
# Extract and count raw bigrams
raw_bigrams = list(bigrams(tokens))
bigram_counts = Counter(raw_bigrams)
print("Top 5 Raw Bigrams:")
for bigram, freq in bigram_counts.most_common(5):
print(f"{bigram}: {freq}")
Output:
Top 5 Raw Bigrams:
('natural', 'language'): 4
('language', 'processing'): 3
('machine', 'learning'): 2
('processing', '.'): 2
('processing', 'is'): 1
Here, we run the NLTK integration finder, apply the filter constraints, and run BigramAssocMeasures class to find phrase associations using Pointwise Mutual Information (PMI):
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics.association import BigramAssocMeasures
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
corpus = """
Natural language processing is an active field of AI. Machine learning plays a key role
in natural language processing. Deep learning architectures have revolutionized natural
language processing. We need machine learning models to solve these natural language tasks.
"""
tokens = word_tokenize(corpus.lower())
# Initialize the collocation finder
finder = BigramCollocationFinder.from_words(tokens)
# Filter out punctuation and stop words
stop_words = set(stopwords.words('english'))
filter_stops = lambda w: w in stop_words or not w.isalnum()
finder.apply_word_filter(filter_stops)
# Filter out bigrams that occur less than N times
finder.apply_freq_filter(2)
# Score bigrams using pointwise mutual information
pmi_measures = BigramAssocMeasures()
top_collocations = finder.score_ngrams(pmi_measures.pmi)
print("Top Collocations by PMI:")
for bigram, pmi_score in top_collocations[:5]:
# Formulate a clean print representation
phrase = " ".join(bigram)
print(f"Phrase: {phrase:<30} | PMI Score: {pmi_score:.4f}")
Output:
Top Collocations by PMI:
Phrase: machine learning | PMI Score: 3.8074
Phrase: language processing | PMI Score: 3.3923
Phrase: natural language | PMI Score: 3.3923
BigramCollocationFinder.from_words()removes all groups of two words while preserving the structural conditions.- We filter candidates using
finder.apply_word_filter()excluding bigrams that contain stop words or punctuation without changing the context of the original word space. - By placing
apply_freq_filter(2)we ignore random combinations that occur only once, reducing statistical noise. - Finally, joint specific information scoring is a statistical measure of the probability that two words occur together divided by the probability that they occur independently. This highlights highly bandied terms like “machine learning” and “natural language” while ignoring common, looser combinations.
# Wrapping up
Custom text preprocessing is key to extracting clean signals from raw text, and NLTK provides the structural tools needed to customize this functionality.
By combining these three NLTK methods, you can create a robust NLP workflow:
- Saving domain names with
MWETokenizerit combines hybrid words at the token level, preventing key concepts from being separated during vectorization - Context-aware lemmatization pairs generated POS tags with a WordNet map to find linguistically accurate base forms, greatly reducing vocabulary size
- Statistical clustering extraction uses statistical correlation metrics such as PMI to distinguish true semantic phrases from raw corpus data, bypassing the noise of simple frequency counting.
Applying these structural patterns to your feature engineering process ensures that downstream classification, search, and clustering algorithms find high-quality, statistically robust tokens.
Matthew Mayo (@mattmayo13) has a master's degree in computer science and a diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor to Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.



