From TF-IDF to Transformers: Implementing Four Generations of Semantic Search

“Beauty will save the world”— Fyodor Dostoevsky
A. Introduction
did not emerge overnight. Today’s transformer-based systems can feel almost magical, capable of capturing context and even subtle relationships between ideas. But the origin of today’s semantic search systems is actually gradual. Before embeddings, transformers, and large language models, researchers used keyword matching, TF–IDF vectors, and traditional machine learning methods to analyze text.
Many of those earlier ideas never truly disappeared. In fact, modern systems still build on concepts developed decades ago. The field evolved layer by layer, with each generation solving some problems while exposing new ones.
Understanding that evolution is important. In machine learning, as in science generally, knowing where we came from often helps us understand where we are heading. The history of semantic search is also the story of an important shift in AI itself: from transparent, human-designed systems to increasingly intelligent models whose internal reasoning is much more difficult to interpret. In that way, we move from explicit retrieval rules and manually engineered features to systems capable of learning abstract representations of meaning directly from data.
In this article, we will explore that progression through a concrete example: comparing a student’s art critique with critiques written by experts about the same painting. Instead of jumping immediately into embeddings and transformers, we will build a sequence of increasingly sophisticated retrieval systems, examining both their strengths and their limitations.
We will cover four major stages in the evolution of semantic search:
- Method 1 — Handcrafted Retrieval Features + TF–IDF
A transparent ranking system combining TF–IDF cosine similarity with interpretable features such as keyword overlap, critique length normalization, and recency weighting. - Method 2 — Classical Machine Learning for Semantic Ranking
Using TF–IDF feature vectors together with supervised learning models such as Logistic Regression to learn ranking behavior from labeled examples. - Method 3 — Embedding-Based Semantic Search
Replacing sparse lexical representations with dense semantic embeddings generated by Sentence Transformers. - Method 4 — Transformer Fine-Tuning
Fine-tuning pretrained transformer architectures such as BERT to directly model semantic relationships between critiques.
Figure 1 below shows the evolution of semantic search methods.
By the end, we will construct increasingly capable semantic search pipelines. In addition, we will gain insight into how the field itself evolved, i.e., from systems driven largely by human-designed features to models that learn meaning directly from data.
B. Data
To keep the focus on semantic search rather than dataset engineering, we will use a small synthetic dataset of art critiques. The dataset was intentionally designed to mimic realistic differences in vocabulary, writing style, interpretation, and analytical depth among critics discussing the same painting.
Each critique contains both metadata and free-form text. Our task throughout the article will be to compare a new student’s critique with expert critiques of the same painting and to determine semantic similarity using progressively more advanced retrieval methods.
The structure of each critique is represented using a simple Python dataclass:
@dataclass
class Critique:
critique_id: str
painting_id: str
critic_name: str
title: str
text: str
published_at: datetime
The text field above contains the main critique content used for semantic analysis, while fields such as painting_id, critic_name, and published_at provide metadata that can support filtering, grouping, or ranking experiments.
A typical critique might look like this:
Critique(
critique_id="c102",
painting_id="starry_night",
critic_name="Dr. Elaine Foster",
title="Emotion Through Motion",
text="""
Van Gogh transforms the night sky into a structure that seems alive.
The swirling brushstrokes generate tension on the soul while the
exaggerated brightness of the stars creates a dreamlike atmosphere.
""",
published_at=datetime(2021, 5, 12)
)
Although synthetic, the dataset is rich enough to demonstrate the central ideas behind semantic retrieval systems — from simple keyword-based similarity to transformer-based representations of meaning.
Please note that the code for all four methods is available on Github. The exact directory is shown at the end of the article.
C. Methods
C.1 Method 1-Rule-Based Retrieval and TF–IDF Ranking
We begin with one of the most classical and interpretable approaches to semantic search: combining TF–IDF ranking with a small set of handcrafted retrieval features. Although simple compared to modern deep learning systems, this approach captures many of the core ideas behind document retrieval and similarity scoring. At this stage, the system does not truly “understand” language. Instead, it identifies patterns in word usage and combines them with manually designed scoring heuristics.
The foundation of the method is TF–IDF (Term Frequency–Inverse Document Frequency), a classic technique for converting text into numerical vectors. TF–IDF increases the importance of words that appear frequently within a document but remain relatively uncommon across the larger collection. Common words such as “the” or “painting” receive very little weight, while more distinctive terms such as “composition,” “contrast,” or “symbolism” become more influential.
After fitting the TF–IDF vectorizer on the expert critiques, the system produces a sparse document-term matrix stored in self.matrix. Each row corresponds to a critique, each column corresponds to a learned term or phrase, and the numerical values represent TF–IDF weights.
Once the critiques have been vectorized, cosine similarity can be used to measure document similarity. Cosine similarity measures the angle between two vectors in high-dimensional space. When two critiques use similar vocabulary in similar proportions, they produce vectors pointing in similar directions and therefore receive higher similarity scores.
In practice, however, TF–IDF similarity alone is often not enough. Two critiques may describe similar artistic ideas with very different wording, while others may appear artificially similar simply because they share technical terminology. To improve retrieval quality, we combine TF–IDF similarity with several additional heuristic features.
The heuristic scoring system includes:
- Keyword overlap — measures how many important words are shared between critiques
- Length normalization — rewards critiques that contain a meaningful level of descriptive detail without favoring excessively long text
- Recency weighting — gently favors newer critiques using exponential temporal decay
The final ranking score is computed as:
(Equation 1)
Each feature is intentionally constrained between 0 and 1. We still apply clipping as a simple safety check:
np.clip(value, 0.0, 1.0)
In our case, clipping works well because the features are already naturally bounded. In larger production systems, however, features with wider numerical ranges, such as popularity statistics or citation counts, would typically require normalization instead.
The length normalization feature rewards critiques that provide sufficient descriptive detail. If the target length is 250 words, the score becomes:
(Equation 2)
For example, a critique with 125 words receives a score of 0.5. Critiques with 250 words or more receive the maximum score of 1.0.
The recency feature introduces a preference for newer critiques, but it still allows older reviews to stay relevant:
(Equation 3)
Using a half-life of roughly 10 years:
- A critique written today receives a score close to 1.0
- A critique written 10 years ago receives approximately 0.5
- A critique written 20 years ago receives approximately 0.25
This creates a smooth notion of “freshness” similar to strategies historically used in search engines and recommendation systems.
One of the biggest strengths of this approach is interpretability. Every part of the ranking process is visible and understandable. We can inspect exactly why one critique ranked above another simply by examining the contribution of each feature.
To test the method, we construct a small synthetic dataset of expert critiques discussing the same painting. We then submit a new student critique and ask the system to retrieve the most similar expert analyses. The new student critique is:
student_critique_text = """
The painting creates a quiet emotional atmosphere, yet very powerful.
The soft light and restrained color palette
make the central figure feel isolated yet dignified. The background
does not compete with the subject; instead, it deepens the mood of
reflection and stillness. Overall, the work feels intimate,
psychological, and carefully composed.
"""
At the end, the program computes a similarity score between the student critique and the expert critiques, as shown below in Table 1.
| CRITIQUE TITLE | EXPERT NAME | SCORE |
| Light and Stillness | Expert A | 0.531 |
| Psychological Interior | Expert D | 0.297 |
| Narrative and Gesture | Expert E | 0.224 |
| Color and Surface | Expert B | 0.212 |
| Historical Symbolism | Expert C | 0.096 |
The ranking makes sense. The student critique put emphasis on soft lighting, restraint of emotions, and psychological atmosphere. These are themes that strongly overlap with the language used in two expert critiques, titled respectively, Light and Stillness and Psychological Interior. Critiques focused primarily on symbolism, technical brushwork, or historical interpretation received lower scores because they shared fewer lexical and heuristic similarities.
At the same time, the limitations of TF–IDF are already becoming visible. The method primarily captures surface-level vocabulary patterns rather than deeper semantic meaning. For example, phrases such as “dramatic use of light” and “strong chiaroscuro effects” may refer to very similar artistic ideas while sharing few exact words. Classical retrieval systems often struggle in these situations because they depend heavily on lexical overlap.
These limitations motivate the next stage in the evolution of semantic search: machine learning models that learn ranking behavior directly from data rather than relying mainly on manually engineered scoring rules.
C.2 Method 2-Classical Machine Learning with TF-IDF Features
The next evolutionary step in semantic search replaces manually designed scoring rules with supervised machine learning. Instead of explicitly deciding how much importance to assign to TF-IDF similarity, keyword overlap, or other heuristic features, we allow a model to learn useful patterns directly from labeled examples.
For this method, we use a different collection of painting critiques than the one introduced in the previous method. In this dataset, some critiques are labeled as “expert-like,” while others are labeled as more novice or beginner-level analyses. Rather than ranking critiques by similarity, the goal here is to train a classifier that can predict whether a critique resembles expert analysis.
As before, the first thing we do is TF-IDF vectorization. Each critique is converted into a high-dimensional numerical vector whose values represent the importance of words and phrases within the document collection. However, instead of comparing vectors directly using cosine similarity, we feed these TF-IDF features into a supervised learning model such as Logistic Regression.
Logistic Regression is one of the classic machine learning methods for classification. Instead of using manually designed rules, the model learns patterns directly from examples. It learns which words and writing styles are more common in expert critiques and then uses these patterns to evaluate new critiques automatically. This is an important shift because the system now learns from data rather than relying on hand-crafted rules.
The code snippet shows the pipeline consisting of the TfIdfVectorizer and Logistic Regression.
model = Pipeline([
("tfidf", TfidfVectorizer(
ngram_range=(1, 2),
lowercase=True,
min_df=1,
stop_words="english"
)),
("classifier", LogisticRegression())
])
After training, the model can analyze a new student critique and produce both:
- a predicted class label
- a probability score indicating how likely the critique is to be expert-like
A probability close to 1 indicates strong similarity to expert critiques, while a probability near 0 suggests more novice-level writing. By default, probabilities greater than or equal to 0.5 are assigned label 1 (“expert-like”), while probabilities below 0.5 are assigned label 0. Our new critique received a label of 1 and had a probability of 0.672.
One of the most interesting aspects of Logistic Regression is interpretability. Because the model learns numerical coefficients for each TF-IDF feature, we can directly inspect which words and phrases influence the classification decisions.
In this experiment, the classifier gave higher weights to terms like “placement,” “emotional,” “depth,” “psychological,” “intensity,” and “shadow.” When we read the critiques themselves, this outcome feels reasonable because these expressions usually appear in expert-like critiques that discuss structure, symbolism, interpretation, or spatial organization in more detail. By comparison, phrases such as “beautiful,” “artist wanted,” and “think” received lower weights. These phrases are more common in novice-like critiques, which focus on general impressions rather than detailed analysis. After training, we can inspect the learned coefficients and see which words influenced the predictions.
| FEATURE | LOGISTIC REGRESSION COEFFICIENT |
| emotional | 0.150719 |
| placement | 0.148277 |
| depth | 0.146912 |
| contrast | 0.146912 |
At the same time, we should be careful not to overstate what the model is doing. The model is not actually interpreting the artwork or appreciating its symbolism the way a human expert would. It is only identifying patterns in the language used in the critiques. If experts consistently use terms such as “depth,” and “psychological tension,” the model learns that these patterns correlate with expert-level writing.
This limitation becomes easier to see when two critiques express similar ideas using very different wording. Logistic Regression works best when similar ideas are expressed with similar words. If the vocabulary changes too much, the model can miss the connection between the critiques. This problem led researchers toward embedding-based methods that try to capture meaning instead of just matching words.
C.3 Method 3-Embedding-Based Semantic Search
The next major step in semantic search goes beyond TF–IDF and simple word counting. Instead of representing text as word frequencies, modern systems use dense semantic embeddings generated by transformer-based language models.
This is the stage where the system starts moving beyond simple vocabulary and begins capturing actual meaning. Two critiques can use very different language to describe an artistic idea, and yet they are still recognized as similar.
To create the embeddings, we use a Sentence Transformer model from the Hugging Face ecosystem. Sentence Transformers transform entire sentences or documents into dense numerical vectors. These vectors are designed to capture the meaning of the text and the relationships between different pieces of writing.
For example, phrases such as:
- “dramatic use of light”
- “careful illumination”
- “strong chiaroscuro effects”
look very different, but they express closely related artistic ideas. Unlike TF-IDF, embedding models can often recognize these semantic relationships. Unlike the Logistic Regression model from Method 2, the embedding model does not assign explicit coefficients to individual words such as “contrast” or “psychological.” Instead, semantic information becomes distributed across many dimensions of the embedding space. This makes the representations harder to interpret directly, but also much more flexible semantically.
For Method 3, we introduce a new set of critiques designed to find semantic similarity at a deeper level. Some critiques use highly technical language, while others describe similar artistic ideas in a more natural or indirect way. This creates a more difficult retrieval problem because critiques may express related concepts without sharing many of the same keywords.
After generating embeddings for all critiques, we compute cosine similarity directly in the embedding space. Each critique embedding generated by the Sentence Transformer is represented as a dense numerical vector of 384 dimensions, corresponding to the number of learned features.
Similarity is computed in two ways: (a) Between all student critiques and all expert critiques, (b) Between student critiques and an expert-centroid. (Table 2). This centroid vector is computed by averaging the corresponding components of all expert critique embeddings. The resulting centroid, therefore, also contains 384 dimensions. Conceptually, this centroid represents the approximate semantic “center” of expert-level critiques and can be used to measure how closely a student critique aligns with expert writing in embedding space.
| STUDENT CRITIQUE NAME AND TITLE | EXPERT CENTROID-LIKENESS SCORE |
| S1-Drama Through Light and Response | 0.802 |
| S4-Emotional Response | 0.618 |
| S5-Formal Analysis Attempt | 0.765 |
| S6-General Impression | 0.75 |
| S7-Symbolic Interpretation | 0.73 |
To understand the embedding space, we also visualize the embeddings using PCA (Figure 2). PCA reduces the many dimensions of the embeddings into two dimensions while preserving much of their semantic meaning.

The PCA plot reveals several interesting relationships. Student Critique S1 appears close to Expert Critiques E1 and E2. This makes sense because they discuss similar ideas such as light, shadow, mood, and dramatic meaning.
Student Critique S7 also appears close to Expert Critique E3. Both critiques discuss symbolism, emotion, and deeper meaning in the painting. Even though they use different words, they express similar ideas.
The PCA plot also shows that student and expert critiques are not separated into perfectly isolated clusters. Some student critiques appear surprisingly close to expert critiques, especially when they discuss similar artistic concepts. At the same time, weaker or more generic critiques tend to appear farther away from the expert region of the embedding space.
The Expert-Likeness Scores (Table 2) also agree with the PCA plot. S1 has the highest score (0.802) and appears close to expert critiques E1 and E2. This suggests that S1 is most similar to the expert critiques. S5 (0.765) and S6 (0.75) also have fairly high scores. In the plot, they appear close to each other and somewhat close to the expert critiques.
S7 has a moderate score (0.73), but it appears very close to E3. Both critiques discuss symbolism, emotion, and deeper meaning. S4 has the lowest score (0.618). In the plot, it also appears farther away from the expert critiques. This critique focuses more on personal feelings than on detailed artistic analysis.
At this stage, despite the move from simple keyword matching to understanding of meaning, the embeddings stay fixed. The next stage introduces transformer models that can adjust their understanding based on the surrounding context.
C.4 Method 4-Fine-Tuned Transformer Models
The final stage introduces fine-tuned transformer models. In Method 3, we used a Sentence Transformer to compare critiques based on semantic similarity. Here, we go a step further by training the model directly on labeled expert and novice critiques.
Specifically, we fine-tune a pretrained DistilBERT model from the Hugging Face Transformers library. DistilBERT is a smaller and faster version of BERT. It was trained to learn many of the same language patterns as the original BERT model while using fewer parameters. DistilBERT was created through a process known as knowledge distillation. Even though it is lighter and easier to train, it still performs very well on many NLP tasks.
In our Method 4, instead of learning the language from scratch, the model (DistilBert) starts with knowledge from large amounts of text and then adapts to our critique-classification task. This process is called transfer learning. Transformers also use attention mechanisms that help the model understand relationships between words in a sentence.
The training pipeline involves:
- tokenizing critiques into transformer-compatible inputs
- fine-tuning the pretrained model on labeled critiques
- generating class probabilities for each critique
Let us discuss the code snippet from Method 4, shown below.
#Load Tokenizer
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint
)
#Tokenize Text
def tokenize_function(example):
return tokenizer(
example["text"],
truncation=True,
padding="max_length",
max_length=128
)
tokenized_dataset = dataset.map(tokenize_function)
The tokenizer created with AutoTokenizer.from_pretrained() is used inside tokenize_function() through the line tokenizer(example["text"], ...).
In transformer-based NLP, the tokenizer is not simply a tokenizer. It performs several preprocessing steps at once:
- it splits the text into tokens
- converts the tokens into numerical token IDs using the model’s vocabulary
- adds special transformer tokens
- truncates long sequences
- pads shorter sequences to a fixed length
- creates attention masks. The resulting numerical representation is what the transformer model later uses as input for training and prediction.
The argument truncation=True ensures that very long critiques are cut to a maximum length. The argument padding="max_length" pads shorter critiques with zeros so that all input sequences have the same fixed length (128 tokens). Finally, dataset.map(tokenize_function) applies this tokenization process to every example in the dataset, producing a transformer-ready dataset for training.
Unlike the embedding-based approach of Method 3, this method performs explicit supervised classification. For each critique, the model predicts both:
- a class label
- a confidence score for each class
For example, consider the following critique:
“The arrangement of the figures and the careful use of shadow create psychological tension and symbolic ambiguity throughout the composition.”
At first glance, this critique sounds relatively sophisticated because it uses advanced artistic language, such as:
- “psychological tension”
- “symbolic ambiguity”
- “composition”
A simpler method, such as TF–IDF might heavily reward these keywords because they frequently appear in expert critiques. In other words, TF–IDF mainly notices that the critique contains important vocabulary associated with art analysis.
However, the transformer model looks beyond isolated keywords. It analyzes how ideas are connected across the sentence and whether the critique shows deeper reasoning. Although the critique uses sophisticated terms, the analysis is brief and somewhat general. It discusses psychological tension and symbolism, but it does not explain them in much detail. Comparing it to the expert critiques, the reasoning is less developed.
After fine-tuning for 100 epochs, the transformer correctly classified the critique as novice-like:
Predicted label: 0
Confidence: 0.685
Probability novice-like: 0.685
Probability expert-like: 0.315
It is interesting to note that, when the model was trained for only 30 epochs, the same critique was classified as expert-like. This suggests that earlier in training, the model may have relied more heavily on fancy vocabulary. Additional training helped it place greater emphasis on broader contextual and analytical patterns rather than keywords alone.
It is important to note one of the main challenges of transformer fine-tuning: transformers usually require large amounts of training data. Our educational dataset contains only a small number of critiques. Because transformer models contain millions of trainable parameters, they generally need much larger datasets to generalize reliably.
As training continues over many epochs, the model becomes increasingly confident in its predictions. With a small dataset, however, some of this confidence may reflect memorization of stylistic patterns seen during training rather than genuine language understanding. This phenomenon is known as overfitting and is especially common when large transformer models are trained on limited data.
This example highlights both the strengths and limitations of transformer models. They can capture meaning beyond simple keyword matching, but they can also become overly confident when training data is scarce.
This final stage completes the progression from:
- transparent heuristic scoring
- classical machine learning
- semantic embeddings
- contextual transformer-based language understanding
Together, these four methods illustrate the broader evolution of semantic search and modern NLP: from manually engineered features toward increasingly sophisticated learned representations of meaning and context.
D. Discussion
The four methods in this article show how semantic search has evolved from simple keyword matching to contextual language understanding.
The first method, TF-IDF with rule-based scoring, was simple and highly interpretable. We could easily see why one critique ranked higher than another. However, the method depended heavily on exact word usage and often missed the deeper meaning.
The second method used Logistic Regression on TF-IDF features. Instead of manually defining rules, the model learned patterns from labeled critiques. By examining the learned coefficients, we can see which words are more common in expert critiques and which are more common in novice critiques. Logistic Regression learns these patterns from the TF-IDF word vectors. As we discussed, the model does not truly understand context or meaning. Despite that, it can still perform surprisingly well when certain words or phrases strongly correlate with particular writing styles.
The third method introduced embeddings through Sentence Transformers. This was a major shift because critiques could now be compared based on semantic meaning rather than exact vocabulary. Critiques discussing similar artistic ideas often appeared close together in embedding space, even when different wording was used.
An important observation from Method 3 was that critique quality is not always clear-cut. Some student critiques appeared semantically close to expert critiques despite still being labeled as novice-like. In this method, the Sentence Transformer acts primarily as a pretrained semantic embedding model. We do not retrain the transformer itself. Instead, each critique is converted into a dense semantic vector, and similarity is measured using cosine similarity in embedding space.
Finally, in Method 4, we presented the fine-tuned transformer model. This model introduced contextual language understanding through DistilBERT. Both Method 2 and Method 4 are supervised learning approaches because they learn from labeled examples. However, they learn very differently. Logistic Regression operates on fixed TF-IDF features, computed from word and phrase frequencies. On the other hand, transformers learn contextual representations by analyzing relationships among words, sentence structure, and meaning.
An important distinction is that although both Method 3 and Method 4 use transformer architectures, they use them in different ways. In Method 3, the transformer is used mainly as a pretrained embedding generator for semantic similarity. In Method 4, the transformer itself is fine-tuned directly on the labeled critique dataset. During training, the model updates its internal weights in order to learn how to distinguish expert-like critiques from novice critiques. Rather than serving mainly as a feature extractor, the transformer itself becomes the classifier. This represents an important conceptual shift from semantic similarity matching to supervised task-specific learning.
The experiments also showed one of the main challenges of transformer fine-tuning: the fact that large models usually need much more training data. When the dataset is small, the model can memorize the training examples too closely and may not able to generalize well to new data.
Overall, we discussed the various methods in a progressive way, which shows that different NLP models represent meaning in different ways. Specifically, TF-IDF focuses mainly on important words, embedding models focus on semantic similarity, and transformers try to understand language through context and relationships between words.
E. Conclusion
In this article, we explored four practical approaches to semantic search, moving from classical TF-IDF retrieval to modern transformer models. Using the example of student and expert painting critiques, we examined how different NLP methods represent language and measure similarity.
The experiments showed that each method has strengths and limitations. Classical methods remain simple, fast, and interpretable. Embedding models capture semantic similarity effectively even with smaller datasets. Transformers provide deeper contextual understanding but typically require more labeled data to generalize reliably.
One of the most important observations was that semantic understanding exists on a continuum. Some student critiques were similar to expert critiques, even if they were not fully expert-level.
Modern NLP systems are becoming better at understanding meaning, context, and relationships between ideas. However, the main goal remains the same: helping machines better understand human language.
The code for the methods described above can be found at:
The synthetic data (critiques) can be found inside the code.
Note: All figures and plots were created by the author.
Thank you for reading!



