Machine Learning “Advent Calendar” Day 22: Embedded in Excel

of this series, we will talk about it deep learning.
And when people talk about deep learning, we immediately think of these architectural images of deep neural networks, with many layers, neurons, and parameters.
Actually, the real change introduced by deep learning is elsewhere.
It's about learning data representations.
In this article, we focus embedding textdescribe their role in the machine learning environment, and show how they can be understood and evaluated in Excel.
1. Classic Machine Salary vs. Deep learning
We will discuss, in this part, why embedding is introduced.
1.1 Where does deep learning come in?
To understand embedding, we first need to define a deep learning environment.
We will use this term classic machine learning to describe methods that do not rely on deep architecture.
All the previous articles talk about classical machine learning, which can be explained in two parallel ways.
Learning paradigms
- Supervised learning
- Unsupervised learning
Model families
- Range-based models
- Tree-based models
- Weight-based models
Throughout this series, we have already learned the learning algorithms behind these models. In particular, we have seen that gradient descent applies to all weight-based models, from linear regression to neural networks.
Deep learning is often reduced to multi-layered neural networks.
But this explanation is not complete.
From a developmental perspective, deep learning does not introduce a new law of learning.
So introduce yourself?
1.2 Deep learning as learning of data representation
Deep learning is about how features are created.
Instead of designing features manually, deep learning it reads presentations automaticallyoften with many successive modifications.
This also raises an important conceptual question:
Where is the boundary between feature engineering and model learning?
Some examples make this clear:
- Polynomial regression is still a linear model, but the factors are polynomial
- Kernel methods project data into a high-dimensional feature space
- Density-based methods completely transform the data before reading
Deep learning continues this idea, but to a degree.
In this view, deep learning is about:
- i philosophy of engineeringrepresentation
- i weight-based family modelreading
1.3 Images and convolutional neural networks
Images are represented as pixels.
From a technical point of view, image data is already numerical and structured: a grid of numbers. However, the information contained in these pixels are structured in a way that primitive models can easily use.
Pixels do not explicitly encode: edges, shapes, textures, or objects.
Convolutional Neural Networks (CNNs) are designed to create information in pixels. They use filters to find local patterns, and gradually combine them into high-level representations.
I published this article showing how CNNs can be used in Excel to make this process transparent.
For photography, the challenge is not to make the data into numbers, but to produce sound representations from numerical data already.
1.4 Text data: a different problem
Text presents a very different challenge.
Unlike images, text is not a number in nature.
Before modeling context or organization, the first problem is very basic:
How do we represent words in terms of numbers?
Creating a numerical representation of the text is the first step.
In the intensive reading of the text, this step is handled by embedding.
Embedding converts discrete symbols (words) into vectors that models can work with. Once the embedding is in place, we can then model: context, order and relationships between words.
In this article, we focus on this first and most important step:
how embedding creates numerical representations of textand how this process can be tested in Excel.
2. Two ways to learn text embedding
In this article, we will use the A dataset of IMDB movie reviews to show both ways. The dataset is distributed under the Apache License 2.0.
There are two main ways to learn text embeddings, and we'll do both with this dataset:
- supervised: we will create an embedding to predict the feeling
- unsupervised or supervised: we will use the word2vec algorithm
In both cases, the goal is the same:
converting words into numeric vectors that can be used by machine learning models.
Before comparing the two approaches, we first need to define what embedding is and how it relates to classical machine learning.

2.1 Embedding and classical machine learning
In classical machine learning, categorical data is usually handled as:
- coding labelwhich gives constant numbers but introduces a passive order
- one hot textwhich removes order but produces high-dimensional sparse vectors
How they are used depends on the nature of the models.
Range-based models cannot effectively use a single encoding, because all phases end up equally spaced. Label encoding can only work if we can specify logical numerical values for the categories, which is not common in the old models.
Weight-based models can use one encoding, because the model learns the weight of each category. In contrast, with label encoding, numerical values are fixed and cannot be fixed to represent meaningful relationships.
Tree-based models treat all variables as categorical variables instead of numerical magnitudes, making label encoding acceptable in practice. However, many applications, including scikit-learn, still require numerical input. As a result, categories must be converted to numbers, either by label encoding or single hot text. If numerical values had a semantic meaning, this would be beneficial as well.
Overall, this highlights the limitation of the old methods:
class values are fixed and unread.
Embedding extends this idea by saying learning the representation itself.
Each word is associated with a trainable vector, turning the classification representation into a learning problem instead of a preprocessing step.
2.2 Supervised embedding
In supervised learning, the embedding is learned as part of the prediction task.
For example, the IMDB data set has labels about sentiment analysis. So we can create a very simple architecture:
In our case, we can use a very simple architecture: each word is mapped to a one-sided embedding
This is possible because the purpose is to separate the feelings of the binary.

Once the training is complete, we can Extract embedded and check them in Excel.
When plotting embedding on the x-axis and word frequency on the y-axis, a clear pattern emerges:
- positive values are associated with words such as very good or wonderful,
- Negative values are associated with words such as even worse or waste
Depending on the implementation, the sign can be reversed, as the logistic regression layer also has parameters that influence the final prediction.

Finally, in Excel, we recreate the full pipeline with the properties we defined earlier.
Input column
The entered text (review) is divided into words, and each line corresponds to one word.
Embedding search
Using the lookup function, the embedding value associated with each word is obtained from the embedding table learned during training.
World average
The average global embedding is computed by averaging the embeddings of all words observed so far. This corresponds to the representation of very simple sentences: the definition of word vectors.
Possible prediction
The average embedding is then passed through an order function to generate an emotional opportunity.

What we see
- Powerful words good embedding (For example very good, love, pleasure) push the average up.
- Powerful words wrong embedding (For example even worse, terrible, waste) pull the ratio down.
- Neutral or weakly weighted words have less influence.
As more terms are added, the global average embedding stabilizes, and the sentiment prediction becomes more confident.
2.3 Word2Vec: embedding from co-occurrence
In Word2Vec, similarity does not mean that two words have the same meaning.
It means that they they appear in similar situations.
Word2Vec learns word embeddings by looking their names tend to occur together within a fixed window in the text. Two words are considered synonymous if they occur frequently surround similar neighboring wordseven if their meanings are opposite.
As shown in the Excel sheet below, we calculate the cosine similarity of the term good and return the most similar words.

From the perspective of the model, the surrounding terms are almost identical. The only thing that changes is the adjective itself.
As a result, Word2Vec reads that “Good” and “bad” play the same role in languagealthough their meanings are opposite.
So, Word2Vec captures uniformity of distributionnot semantic polarity.
A useful way to think about it is:
Words are close when they are used in similar places.
2.4 How embedding is used
In modern systems such as RAG (Recovery-Advanced Generation)embedding is often used to retrieve documents or passages to answer questions.
However, this approach has limitations.
The most commonly used embedding is trained in a self-monitoring method, based on co-occurrence or context prediction purposes. As a result, they capture general language similarities, not task-specific meaning.
This means that:
- embedded may return linguistically similar text but it is not important
- semantic proximity is not convincing answer accurately
Other embedding techniques can be used, including task-oriented or supervised embedding, but they generally remain supervised at their core.
Understanding how embeddings are built, what they write, and what they don't write about is important before using them in downstream programs like RAG.
The conclusion
Embedded are learned numerical representations of words that make similarity quantifiable.
Whether learned by supervision or by association, it embeds map words into vectors based on how they are used in the data. By exporting them to Excel, we can directly examine these representations, calculate similarities, and understand what they capture and what they don't.
This demystifies embedding and clarifies its role as the basis for complex systems such as retrieval or RAG.


