Learning Word Vectors for Sentiment Analysis: A Python Reproduction

0 1 11 minutes read

Learning Word Vectors for Sentiment Analysis: A Python Reproduction

We automated the analysis and made the code available on GitHub.

came to me when I tried to reproduce the paper “Learning Word Vectors for Sentiment Analysis” by Maas et al. (2011).

At the time, I was still in my final year of engineering school. The goal was to reproduce the paper, challenge the authors’ methods, and, if possible, compare them with other word representations, including LLM-based approaches.

What struck me was how simple and elegant the method was. In a way, it reminded me of logistic regression in credit scoring: simple, interpretable, and still powerful when used correctly.

I enjoyed reading this paper so much that I decided to share what I learned from it.

I strongly recommend reading the original paper. It will help you understand what is at stake in word representation, especially how to analyze the proximity between two words from both a semantic perspective and a sentiment polarity perspective, given the specific contexts in which those words are used.

At first, the model seems simple: build a vocabulary, learn word vectors, incorporate sentiment information, and evaluate the results on IMDb reviews.

But when I started implementing it, I realized that several details matter a lot: how the vocabulary is built, how document vectors are represented, how the semantic objective is optimized, and how the sentiment signal is injected into the word vectors.

In this article, we will reproduce the main ideas of the paper using Python.

We will first explain the intuition behind the model. Then we will present the structure of data used in the article, construct the vocabulary, implement the semantic component, add the sentiment objective, and finally evaluate the learned representations using the linear SVM classifier.

The SVM will allow us to measure the classification accuracy and compare our results with those reported in the paper.

What problem does the paper solve?

Traditional Bag of Words models are useful for classification, but they do not learn meaningful relationships between words. For example, the words wonderful and amazing should be close because they express similar meaning and similar sentiment. On the other hand, wonderful and terrible may appear in similar movie review contexts, but they express opposite sentiments.

The goal of the paper is to learn word vectors that capture both semantic similarity and sentiment orientation.

Data structure

The dataset contains:

25,000 labeled training reviews or documents
50,000 unlabeled training reviews
25,000 labeled test reviews

The labeled reviews are polarized:

Negative reviews have ratings from 1 to 4
Positive reviews have ratings from 7 to 10

The ratings are linearly mapped to the interval [0, 1], which allows the model to treat sentiment as a continuous probability of positive polarity.

aclImdb/
├── train/
│   ├── pos/    "0_10.txt"   -> review #0, 10 stars, very positive
│   │           "1_7.txt"    -> review #1, 7 stars, positive
│   ├── neg/    "10_2.txt"   -> review #10, 2 stars, very negative
│   │           "25_4.txt"   -> review #25, 4 stars, negative
│   └── unsup/  "938_0.txt"  -> review #938, 0 stars, unlabeled
└── test/
    ├── pos/    positive reviews, never seen during training
    └── neg/    negative reviews, never seen during training

We can therefore store each document in a Review class with the following attributes: text, stars, label, and bucket.

Of course, it does not have to be a class specifically named Review. Any object can be used as long as it provides at least these attributes.

from dataclasses import dataclass
from typing import Optional

@dataclass
class Review:
    text: str
    stars: int            
    label: str               
    bucket: str

Vocabulary construction

The paper builds a fixed vocabulary by first ignoring the 50 most frequent terms, then keeping the next 5,000 most frequent tokens.

No stemming is applied. No standard stopword removal is used. This is important because some stopwords, especially negations, can carry sentiment information.

Before building this vocabulary, we first need to look at the raw data.

We noticed that the reviews are not fully cleaned. Some documents contain HTML tags, so we remove them during the data loading step. We also remove punctuation attached to words, such as ".", ",", "!", or "?".

This is a slight difference from the original paper. The authors keep some non-word tokens because they may help capture sentiment. For example, "!" or ":-)" can carry emotional information. In our implementation, we choose to remove this punctuation and later evaluate how much this decision affects the final model performance.

When working with text data, the next question is always the same:

How should we represent documents and words numerically?

The authors start by collecting all tokens from the training set, including both labeled and unlabeled reviews. We can think of this as putting all words from the training documents into one large basket.

Then, to represent words in a space where we can train a model, they build a set of words called the vocabulary.

The authors build a dictionary that maps each token, which we will loosely call a word, to its frequency. This frequency is simply the number of times the token appears in the full training set, including both labeled and unlabeled reviews.

Then they select the 5,000 most frequent words, after removing the 50 most frequent terms.

These 5,000 words form the vocabulary V.

Each word in V will correspond to one column of the representation matrix R. The authors choose to represent each word in a 50-dimensional space. Therefore, the matrix R has the following shape:

$R in mathbb{R}^{beta = 50times |V| = 5000}$

Each column of R is the vector representation of one word: $phi_w = Rw$

The goal of the model is to learn this matrix R so that the word vectors capture two things at the same time:

Semantic information, meaning words used in similar contexts should be close;
Sentiment information, meaning words carrying similar polarity, should also be close.

This is the central idea of the paper.

Once the data is loaded, cleaned, and the vocabulary is built, we can move to the construction of the model itself.

The first part of the model is unsupervised. It learns semantic word representations from both labeled and unlabeled reviews.

Then, the second part adds supervision by using the star ratings to inject sentiment into the same vector space.

Semantic component

The semantic component defines a probabilistic model of a document.

Each document is associated with a latent vector theta. This vector represents the semantic direction of the document.

Each word has a vector representation $phi_w$ , stored as a column of the matrix R.

The probability of observing a word w in a document is given by a softmax model:

$p(w mid theta; R, b) = frac{exp(theta^top phi_w + b_w)}{sum_{w’ in V} exp(theta^top phi_{w’} + b_{w’})}$

Intuitively, a word becomes likely when its vector $phi_w$ is well aligned with the document vector theta.

MAP estimation of theta

The model alternates between two steps.

First, it fixes R and b and estimates one theta vector for each document.

Then, it fixes theta and updates R and b.

The theta vectors are not stored as final parameters. They are temporary document-specific variables used to update the word representations.

To estimate the parameters of the model, the authors use maximum likelihood.

The idea is simple: we want to find the parameters R and b that make the observed documents as likely as possible under the model.

Starting from the probabilistic formulation of a document, they introduce a MAP estimate θ̂ₖ for each document dₖ. Then, by taking the logarithm of the likelihood and adding regularization terms, they obtain the objective function used to learn the word representation matrix R and the bias vector b:

$nu |R|_F^2 + sum_{d_k in D} lambda |hat{theta}_k|_2^2 + sum_{i=1}^{N_k} log p(w_i mid hat{theta}_k; R, b)$

which is maximized with respect to R and b. The hyperparameters in the model are the regularization weights (λ and ν) and the word vector dimensionality β.

In this step, we learn the semantic representation matrix. This matrix captures how words relate to each other based on the contexts in which they appear.

Sentiment component

The semantic model alone can learn that words occur in similar contexts. But this is not enough to capture sentiment.

For example, wonderful and terrible may both occur in movie reviews, but they express opposite opinions.

To solve this, the paper adds a supervised sentiment objective:

$p(s = 1 mid w; R, psi) = sigma(psi^top phi_w + b_c)$

The vector ψ defines a sentiment direction in the word vector space. Here, only the labelled data are used.

If a word vector lies on one side of the hyperplane, it is considered positive. If it lies on the other side, it is considered negative.

They combined the sentiment objective and the sentiment part to build the final and the full objective learning:

$begin{aligned} nu |R|_F^2 &+ sum_{k=1}^{|D|} lambda |hat{theta}_k|_2^2 + sum_{i=1}^{N_k} log P(w_i mid hat{theta}_k; R, b) \ &+ sum_{k=1}^{|D|} frac{1}{|S_k|} sum_{i=1}^{N_k} log P(s_k mid w_i; R, psi, b_c) end{aligned}$

The first part learns semantic similarity. The second part injects sentiment information. The regularization terms prevent the vectors from growing too large.

| $S_k$ | denotes the number of documents in the dataset with the same rounded value of $s_k$ . The weighting $frac{1}{|S_k|}$ is introduced to combat the well-known imbalance in ratings present in review collections.

Classification and results

Once the word representation matrix R has been learned, we can use it to build document-level features.

The objective is now to classify each movie review as positive or negative.

To do this, the authors train a linear SVM on the 25,000 labeled training reviews and evaluate it on the 25,000 labeled test reviews.

The important question is not only whether the word vectors are meaningful, but whether they help improve sentiment classification.

To answer this question, we evaluate several document representations and compare them with the results reported in Table 2 of the paper.

The only thing that changes from one configuration to another is the way each review is represented before being passed to the classifier.

1. Bag of Words baseline

The first representation is a standard Bag of Words. In the paper, this baseline is reported as Bag of Words (bnc). The notation means:

b = binary weighting
n = no IDF weighting
c = cosine normalization

A review or document is represented by a vector v of size 5000, because the vocabulary contains 5,000 words.

For each word j in the vocabulary:

$nu_j = begin{cases} 1 & text{if word } j text{ appears in the review} \ 0 & text{otherwise} end{cases}$

So this representation only records whether a word appears at least once. It does not count how many times it appears.

Then the vector is normalized by its Euclidean norm:

$nu_{bnc} = frac{nu}{|nu|_2}$

This gives the Bag of Words baseline used to train the SVM.

This baseline is strong because sentiment classification often relies on direct lexical clues. Words such as excellent, boring, awful, or great already carry useful sentiment information.

2. Semantic-only word vector representation

The second representation uses the word vectors learned by the semantic-only model.

The authors first represent a document as a Bag of Words vector v. Then they compute a dense document representation by multiplying this vector by the learned matrix:

$z_{text{semantic}} = R_{text{semantic}} times nu$

Where $R_{text{semantic}} in mathbb{R}^{50 times 5000}, nu in mathbb{R}^{5000} quadimpliesquad z_{text{semantic}} in mathbb{R}^{50}$

This vector can be interpreted as a weighted combination of the word vectors that appear in the review.

In the paper, when generating document features through the product Rv, the authors use bnn weighting for v. This means:

b = binary weighting
n = no IDF weighting
n = no cosine normalization before projection

Then, after computing Rv, they apply cosine normalization to the final dense vector.

So the final representation is:

$bar{z}_{text{semantic}} = frac{R_{text{semantic}} nu}{| R_{text{semantic}} nu |_2}$

This representation uses semantic information learned from the training reviews, including both labeled and unlabeled documents.

3. Full semantic + sentiment representation

The third representation follows the same construction, but uses the full matrix Rfull.

This matrix is learned with both components of the model:

the semantic objective, which learns contextual similarity between words;
The sentiment objective, which injects polarity information from the star ratings.

For each document, we compute:

$z_{text{full}} = R_{text{full}} nu$

Then we normalize:

$bar{z}_{text{full}} = frac{R_{text{full}} nu}{| R_{text{full}} nu |_2}$

The intuition is that $R_{full}$ should produce document features that capture both what the review is about and whether the language is positive or negative.

This is the main contribution of the paper: learning word vectors that combine semantic similarity and sentiment orientation.

4. Full representation + Bag of Words

The final configuration combines the learned dense representation with the original Bag of Words representation.

We concatenate the two representations to obtain:

$x = left[ bar{z}_{text{full}} ;middle|; nu_{bnc} right]$

This gives the classifier two complementary sources of information:

a dense 50-dimensional representation learned by the model;
a sparse lexical representation that preserves exact word-presence information.

This combination is useful because word vectors can generalize across similar words, while Bag of Words features keep precise lexical evidence.

For example, the dense representation may learn that wonderful and amazing are close, while the Bag of Words representation still preserves the exact presence of each word.

We then train a linear SVM on the labeled training set and evaluate it on the test set.

This allows us to answer two questions.

First, do the learned word vectors improve sentiment classification?

Second, does adding sentiment information to the word vectors help beyond semantic information alone?

Implementation in Python

We implement the model in five steps:

Load and clean the IMDb dataset
Build the vocabulary
Train the semantic component
Train the full semantic + sentiment model
Evaluate the learned representations using SVM

The table below shows the nearest neighbors of selected target words in the learned vector space.

For each target word, we report the five most similar words according to cosine similarity. The full model, which combines the semantic and sentiment objectives, tends to retrieve words that are close both in meaning and in sentiment orientation. The semantic-only model captures contextual and lexical similarity, but it does not explicitly use sentiment labels during training.

The table below compares our results with the results reported in the paper. For each representation, we train a linear SVM on the labeled training reviews and report the classification accuracy on the test set. This allows us to evaluate how well each document representation performs on the IMDb sentiment classification task.

The full model is very close to the result reported in the paper. This suggests that the sentiment objective is implemented correctly.

The largest gap appears in the semantic-only model. This may come from optimization details, preprocessing, or the way document-level features are constructed for classification.

Conclusion

In this article, we reproduced the main components of the model proposed by Maas et al. (2011).

We implemented the semantic objective, added the sentiment objective, and evaluated the learned word vectors on IMDb sentiment classification.

The model shows how unlabeled data can help learn semantic structure, while labeled data can inject sentiment information into the same vector space.

This is a simple but powerful idea: word vectors should not only capture what words mean, but also how they feel.

While this post does not cover every detail of the paper, we highly recommend reading the authors’ original work. Our goal was to share the ideas that inspired us and the enjoyment we found both in reading the paper and writing this post.

We hope you enjoy it as much as we did.

Image Credits

All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

[1] 𝗔𝗻𝗱𝗿𝗲𝘄 𝗟. 𝗠𝗮𝗮𝘀, 𝗥𝗮𝘆𝗺𝗼𝗻𝗱 𝗘. 𝗗𝗮𝗹𝘆, 𝗣𝗲𝘁𝗲𝗿 𝗧. 𝗣𝗵𝗮𝗺, 𝗗𝗮𝗻 𝗛𝘂𝗮𝗻𝗴, 𝗔𝗻𝗱𝗿𝗲𝘄 𝗬. 𝗡𝗴, 𝗮𝗻𝗱 𝗖𝗵𝗿𝗶𝘀𝘁𝗼𝗽𝗵𝗲𝗿 𝗣𝗼𝘁𝘁𝘀. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.

Dataset: IMDb Large Movie Review Dataset (CC BY 4.0).

Source link

nimda 2 hours ago

0 1 11 minutes read