Machine Learning “Advent Calendar” Day 23: CNN in Excel

nimda December 24, 2025

0 12 7 minutes read

Machine Learning “Advent Calendar” Day 23: CNN in Excel

they were first presented with pictures, and pictures are often easier to understand.

The filter slides over the pixels and detects edges, shapes, or textures. You can read this article I wrote earlier to understand how CNNs work on images with Excel.

In writing, the idea is the same.

Instead of pixels, we slide the filters over words.

Instead of visible patterns, we find language patterns.

And many of the important patterns in the text are very local. Let's take these simple examples:

“good” is positive
“bad” is negative
“wrong” is negative
“not bad” is often good

In my previous article, we saw how to represent words as numbers using embedding.

We also noticed an important limitation: when we use the global average, word order was completely ignored.

From the perspective of the model, “not good” and “not good” look exactly the same.

So the following challenge is clear: we want the model to consider word order.

A 1D Convolutional Neural Network is a natural tool for this, because it scans the sentence in small sliding windows and responds when it sees regular spatial patterns.

1. Understanding 1D CNN for text: Structure and depth

1.1. Creating a 1D CNN for text in Excel

In this article, we build a 1D CNN architecture in Excel with the following components:

It embeds the dictionary
We use 2-dimensional embedding. Because one size is not enough for this job.
One dimension is encoding feelingand the second dimension encodes carelessness.
Conv1D layer
This is the core part of the CNN architecture.
It contains filters that slip into a sentence with a window length of 2 words. We choose 2 words for simplicity.
ReLU and global max pooling
These steps retain only the strongest matches found by the filters.
We will also discuss the fact that ReLU is optional.
The reversal of things
This is the final classification layer, which combines the detected patterns into possible ones.

1D CNN in Excel – all images by the author

This pipeline corresponds to the standard text class of CNN.
The only difference here is that we write clearly and visualize it going forward in Excel.

1.2. What is meant by “deep learning” in this diagram

Before continuing, let's take a step back.
Yes, I know, I do this often, but having a global view of the models really helps to understand them.

Definition of deep learning it tends to fade.
For many people, deep learning means “multiple layers”.

Here, I will take a slightly different view.

What really marks deep learning is not the number of layers, but the depth of the transformation applied to the input data.

With this definition:

Even a model with a single convolution layer can be considered deep learning,
because the input is converted into a formal and abstract representation.

On the other hand, taking raw input data, using hot coding, and stacking multiple fully connected layers does not make the model deep in a meaningful sense.
In theory, if we don't have a modification, one layer is enough.

For CNNs, the existence of multiple layers has a very practical motivation.

Consider a sentence like:

This movie is not good at all

With a single convolution layer and a small window, we can find simple local patterns like: “very + good”

But currently we can't find high-level patterns like: “not + (very good)”

This is why CNNs are often packaged:

the first layer finds simple spatial patterns,
the second layer combines them into more complex ones.

In this article, we focus on purpose one convolution layer.
This makes each step visible and easy to understand in Excel, while keeping the logic similar to deep CNN architecture.

2. Converting words into embedded

Let's start with simple words. We will try to detect negativity, so we will use these terms, in other words (which we will not model)

“good”
“bad”
“not good”
“not bad”

We keep the presentation small on purpose so that every step can be seen.

We will only use a dictionary of three words: good, bad and no.

All other words will have 0 as the embedding.

2.1 Why one factor is not enough

In the previous article about emotion detection, we used a single feature.
That worked for “good” versus “bad”.

But now we want to host carelessness.

One dimension can represent one concept well.
So we need you two dimensions:

cent: emotional polarity
neg: negation marker

2.2 The embedding dictionary

Each word becomes a 2D vector:

positive → (cent = +1, neg = 0)
negative → (cent = -1, neg = 0)
not → (cent = 0, neg = +1)
any other word → (0, 0)

This is not what real embedding looks like. The actual embedding is readable, high, and not directly interpretable.

But to understand how Conv1D works, this toy embed is fine.

In Excel, this is just a lookup table.
In a real neural network, this embedding matrix can be trained.

3. Conv1D filters as sliding pattern detectors

Now we come to the core idea of 1D CNN.

The Conv1D filter is not ambiguous. It's just something a subset of weights and biases that slips over the sentence.

Because:

each word embedding has 2 values (cent, neg)
our window contains 2 words

each filter has:

4 weights (2 sizes × 2 positions)
1 bias

That is all.

You can think of a filter as repeatedly asking the same question in all places:

“Do these two neighboring words match the pattern I care about?”

3.1 Sliding windows: how Conv1D sees a sentence

Consider this sentence:

it's not bad at all

We choose a window size of 2 words.

That means the model looks for all adjacent pairs:

(corner)
(it's not)
(not bad)
(bad, it)
(at all)

Important point:
Filters slide everywhereeven if both terms are neutral (all zeros).

3.2 Four precise filters

To make it easier to understand behavior, we use four filters.

Filter 1 – “I See GOOD”

This filter only looks at sentiments of present voice.

A plain text equation for one window:

z = cent(current_name)

If the word is “good”, z = 1
If the word is “bad”, z = -1
If the term is neutral, z = 0

After ReLU, negative values become 0. But it is optional.

Filter 2 – “I See BAD”

This one is symmetric.

z = -cent(current_name)

So:

“negative” → z = 1
“good” → z = -1 → ReLU → 0

Filter 3 – “I CAN'T SEE IT”

This filter looks for two things at once:

neg(previous_name)
voice(current_name)

Figure:

z = neg(previous_name) + cent(current_name) – 1

Why “-1”?
It acts as a boundary so that both conditions are true.

Results:

“incorrect” → 1 + 1 – 1 = 1 → activated
“good” → 0 + 1 – 1 = 0 → not activated
“not bad” → 1 – 1 – 1 = -1 → ReLU → 0

Filter 4 – “I SEE NO BAD”

Same idea, slightly different sign:

z = neg(previous_name) + (-cent(current_name)) – 1

Results:

“not bad” → 1 + 1 – 1 = 1
“not good” → 1 – 1 – 1 = -1 → 0

This is a very important intuition:

A CNN filter can work like reasonable local lawread from the data.

3.3 The final effect of sliding windows

Here are the final results of these 4 filters.

4. ReLU and high integration: from local to global

4.1 RELU

After computing z in all windows, we use ReLU:

ReLU(z) = max(0, z)

Description:

contrary evidence is ignored
good evidence is maintained

Each filter becomes a presence detector.

By the way, it is an activation function in a Neural network. So Neural network is not that difficult.

4.2 Integrating Global Max

Then it comes global max pooling.

For each filter, we only store:

maximum activation in all windows

Interpretation:
“I don't care where the pattern comes from, only if it seems solid somewhere.”

At this point, the entire sentence is shortened to 4 numbers:

a very strong “good” signal
a very strong “bad” signal
a strong “negative” signal.
a strong “bad” signal

4.3 What happens if we remove ReLU?

Without ReLU:

negative values are always odd
the maximum integration may select the wrong values

This includes two ideas:

the absence of a pattern
the opposite of the pattern

The filter ceases to be a pure detector and becomes a signed point.

The model still works mathematically, but interpretation becomes difficult.

5. The last layer is regression

Now we combine these symbols.

We calculate the score using a linear combination:

score = 2 × F_good – 2 × F_bad – 3 × F_not_good – 3 × F_not_bad – bias

Then we convert the score to probability:

probability = 1 / (1 + exp(-score))

That is exactly the reverse of things.

So yes:

CNN features released: this step can be considered feature engineering, right?
logistic regression makes the final decisions, a classic machine learning model we know well