Reactive Machines

Modern Topic Modeling in Python

Topic modeling uncovers hidden themes in large collections of documents. Traditional methods such as Latent Dirichlet Allocation rely on word density and treat text as bags of words, often lacking deep context and meaning.

BERTopic takes a different route, combining transformer embedding, clustering, and c-TF-IDF to capture semantic relationships between documents. Generates logical, context-aware articles suitable for real-world data. In this article, we break down how BERTopic works and how you can use it step by step.

What is BERTopic?

BERTopic is a topic modeling framework that handles topic discovery as a pipeline of independent but connected steps. It combines deep learning and classical natural language processing techniques to generate relevant and interpretable articles.

The main idea is to convert documents into semantic embeddings, group them based on similarity, and extract words that represent each group. This approach allows BERTopic to capture both meaning and structure within textual data.

At a high level, BERTopic follows this process:

Each part of this pipeline can be modified or replaced, making BERTopic very flexible for different applications.

Key Components of the BERTopic Pipeline

1. Preprocessing

The first step involves preparing the raw text data. Unlike traditional NLP pipelines, BERTopic does not require heavy pre-processing. Minor cleanups, such as making lowercase letters, removing extra spaces, and filtering out very short text are usually sufficient.

2. Embedding of Documents

Each document is transformed into a dense vector using transformer-based models such as SentenceTransformers. This allows the model to capture semantic relationships between documents.

Statistically:

Embedding of Documents

When di document and vi its vector representation.

3. Size reduction

High-dimensional embeddings are difficult to integrate successfully. BERTopic uses UMAP to reduce dimensionality while preserving data structure.

Size reduction

This step improves cluster performance and computing efficiency.

4. Integration

After dimensionality reduction, clustering was performed using HDBSCAN. This algorithm groups similar documents into clusters and identifies outliers.

Integration

When zi label for a given topic. The documents are labeled as -1 they are considered outsiders.

5. Representation of c-TF-IDF topic

Once the clusters are created, BERTopic generates topic representations using ic-TF-IDF.

Time Frequency:

Time Frequency

Inverse Class Frequency:

Inverse Class Frequency

Last IC-TF-IDF:

cTFIDF

This approach highlights words that are unique within a cluster while reducing the importance of common words across clusters.

Manual Initiation

This section shows a simple implementation of BERTopic using a very small dataset. The goal here is not to build a production scale topic model, but to understand how BERTopic works step by step. In this example, we preprocess the text, configure UMAP and HDBSCAN, train the BERTopic model, and test the generated topics.

Step 1: Import the Libraries and Prepare the Dataset

import re
import umap
import hdbscan
from bertopic import BERTopic

docs = [
"NASA launched a satellite",
"Philosophy and religion are related",
"Space exploration is growing"
] 

In this first step, the required libraries are imported. The re module is used for basic text processing, while umap and hdbscan are used for dimension reduction and clustering. BERTopic is a core library that compiles these components into a topic matcher.

A small list of sample documents is being created. These texts belong to different themes, such as space and philosophy, which makes them useful in showing how BERTopic tries to divide the text into different topics.

Step 2: Pre-tune the Text

def preprocess(text):
    text = text.lower()
    text = re.sub(r"s+", " ", text)
    return text.strip()

docs = [preprocess(doc) for doc in docs]

This step performs basic text cleaning. Each document is converted to lowercase so that words like “NASA” and “nasa” are treated as the same symbol. Extra spaces are also removed to match formatting.

Preprocessing is important because it reduces noise in the input. Although BERTopic uses transformer embedding that is less dependent on heavy text sanitization, the simplification still improves consistency and makes the input cleaner for downstream processing.

Step 3: Configure UMAP

umap_model = umap.UMAP(
    n_neighbors=2,
    n_components=2,
    min_dist=0.0,
    metric="cosine",
    random_state=42,
    init="random"
)

UMAP is used here to reduce the size of document embeddings before merging. Since embeddings are often high dimensional, directly integrating them is often difficult. UMAP helps to express them in a low-dimensional space while preserving their semantic relationships.

The parameter init=”random” is very important in this example because the dataset is very small. With only three documents, UMAP's automatic spectral initialization may fail, so random initialization is used to avoid that error. The settings n_neighbors=2 and n_components=2 are chosen to fit this small dataset.

Step 4: Configure HDBSCAN

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=2,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True
)

HDBSCAN is the clustering algorithm used by BERTopic. Its role is to group similar documents together after size reduction. Unlike methods such as K-Means, HDBSCAN does not require the number of clusters to be specified in advance.

Here, min_cluster_size=2 means that at least two scripts are needed to create a cluster. This is appropriate for such a small example. I prediction_data=True The argument allows the model to store information useful for later understanding and probability estimation.

Step 5: Create a BERTopic Model

topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    calculate_probabilities=True,
    verbose=True
) 

In this step, a BERTopic model is created by passing custom UMAP configuration with HDBSCAN. This shows the strength of BERTopic: it is modular, so individual components can be customized according to the dataset and the use case.

The calculate_probabilities=True option enables the model to estimate the probability of a topic in each document. The verbose=True option is useful during testing because it shows the progress and internal processing steps while the model is running.

Step 6: Enter the BERTopic model

topics, probs = topic_model.fit_transform(docs) 

This is the main training step. BERTopic now does a complete job inside:

  1. Converts documents to embedded
  2. Reduce embedding size using UMAP
  3. Includes reduced embedding using HDBSCAN
  4. Extracts subject names using ic-TF-IDF

The result is stored in two outputs:

  • titles, which contains the title label assigned to each document
  • problems, which contain probability distributions or confidence values ​​for assignments

This is the point where the raw documents are converted into a topic-based structure.

Step 7: View Title Assignments and Title information

print("Topics:", topics)
print(topic_model.get_topic_info())

for topic_id in sorted(set(topics)):
    if topic_id != -1:
        print(f"nTopic {topic_id}:")
        print(topic_model.get_topic(topic_id))
Output

This last step is used to check the output of the model.

  • print("Topics:", topics) shows the title label assigned to each document.
  • get_topic_info() displays a summary table of all topics, including topic IDs and the number of documents in each topic.
  • get_topic(topic_id) returns the top names of the given topic.

Condition if subject_id != -1 exclude outliers. In BERTopic, a topic label of -1 means that the document has not been confidently assigned to any collection. This is common behavior in density-based clustering and helps avoid forcing unrelated documents into incorrect topics.

Benefits of BERTopic

Here are the main benefits of using BERTopic:

  • It captures semantic meaning using embeddings
    BERTopic uses transformer-based embedding to understand text context rather than just word frequency. This allows it to group documents with similar definitions even if they use different names.
  • It automatically determines the number of topics
    Using HDBSCAN, BERTopic does not require a predefined number of topics. Discovers the natural structure of data, making it suitable for unknown or dynamic data sets.
  • It handles noise and external things well
    Documents that do not belong to any collection are labeled as outliers instead of being forced into the wrong topics. This improves the overall quality and clarity of the articles.
  • Generates interpretable topic presentations
    With c-TF-IDF, BERTopic extracts keywords that clearly represent each topic. These words are unique and easy to understand, which makes the translation straightforward.
  • It is highly modular and can be customized
    Each part of the pipeline can be edited or modified, such as embedding, merging, or vectorization. This flexibility allows it to adapt to different data sets and use cases.

The conclusion

BERTopic represents a major advance in topic matching by combining semantic embedding, dimensionality reduction, clustering, and class-based TF-IDF. This hybrid approach allows it to produce meaningful and interpretable topics that are closely related to human understanding.

Rather than relying solely on word frequency, BERTopic leverages the structure of the semantic space to identify patterns in text data. Its modular design also makes it suitable for many different applications, from analyzing customer feedback to editing research documents.

In practice, the effectiveness of BERTopic depends on careful selection of embeddings, tuning of integration parameters, and thoughtful analysis of results. When used correctly, it provides a powerful and efficient solution for modern subject modeling tasks.

Frequently Asked Questions

Q1. What makes BERTopic different from traditional topic modeling methods?

A. It uses semantic embedding instead of word frequency, allowing it to capture context and meaning more effectively.

Q2. How does BERTopic determine the number of topics?

A. It uses the HDBSCAN cluster, which automatically finds a natural number of topics without any predefined input.

Q3. What is the main limitation of BERTopic?

A. It is computationally expensive due to generating embeddings, especially for large datasets.

Janvi Kumari

Hi, I'm Janvi, a data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how to extract valuable insights from complex datasets.

Sign in to continue reading and enjoy content curated by experts.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button