Fine-tune Topic travel for your work with a bertopic

nimda August 12, 2025

0 24 6 minutes read

Fine-tune Topic travel for your work with a bertopic

The title model remains a critical tool in AI and NLP toolbox. While large models of language (LLMS) treat the text properly, issuing high quality topics from large data requires strategies provided for dedicated topics. The general performance spending includes four basic steps: embryos, reduction of magnitude, integration, and representative representation.

Today's spaces BartopicWhy Simple Each Category Hyperts Modar and API is correct. In this post, I will travel to the practical changes you can do to improve the consequences of integration and expansion based on the use of the operating dataset, distributed under the normal Comative Commons Rombition 4.0.

All of the project is

We will start with the default settings recommended in Bertopic and continuously update the configuration to highlight their results. On the way, I will explain the purpose of each Module and how I can inform the information decisions when you show yourself.

Preparing Dataset

We upload a 500-story text sample.

import random
from datasets import load_dataset
dataset = load_dataset("SetFit/20_newsgroups")
random.seed(42)
text_label = list(zip(dataset["train"]["text"], dataset["train"]["label_text"]))
text_label_500 = random.sample(text_label, 500)

As information from the common UDIt discussions, we use cleaning measures to hold articles, delete clutter, and maintain instruction sentences only.

This benefits of this ensures high quality prevention and the smooth process of the tontream.

import re

def clean_for_embedding(text, max_sentences=5):
    lines = text.split("n")
    lines = [line for line in lines if not line.strip().startswith(">")]
    lines = [line for line in lines if not re.match
            (r"^s*(from|subject|organization|lines|writes|article)s*:", line, re.IGNORECASE)]
    text = " ".join(lines)
    text = re.sub(r"s+", " ", text).strip()
    text = re.sub(r"[!?]{3,}", "", text)
    sentence_split = re.split(r'(?<=[.!?]) +', text)
    sentence_split = [
        s for s in sentence_split
        if len(s.strip()) > 15 and not s.strip().isupper()
    ]
    return " ".join(sentence_split[:max_sentences])
texts_clean = [clean_for_embedding(text) for text,_ in text_label_500]
labels = [label for _, label in text_label_500]

First bertopic pipe

Using a Bertopic's Modular Design, we prepare each part: Sentenstranformer for Meddings, Map Completion of Maximum Hdbscan, and Countverizer + Keybert of the text. This setup points only a few widespread topics of noisy conditions, highlighting the need for proper planning to achieve compatible results.

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer

from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic representations
)
topics, probs = topic_model.fit_transform(texts_clean)

This setup points only a few broad articles with noisy conditions. This outcome highlights the need to monitor the compatible results.

Real topics found (the image produced by the writer)

Parameter Tuning for Granular Articles

n_enighbors from Map Module

Map is Module to reduce the extent of reducing the origin of the origin to urge the origin to the tallest vector of size. By changing N_EIGHBORS in Map, we control how your data is how to be integrated during the reduction of magnitude. Reduce this number to prevent stable collections and improve different topics.

umap_model_new = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model.umap_model = umap_model_new
topics, probs = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

Photos produced by the writer — Articles found after setting up Map's N_EIGBOR (Picture produced by the writer)

Min_clister_melate Cluster_selection_methode from HDBSCan module

HDBANS is a module in a module automatically for bertopic. By changing the HDBANCan min_cluster these settings help reveal small elections, focused and balanced distribution across qualifications.

hdbscan_model_leaf = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)
topic_model.hdbscan_model = hdbscan_model_leaf
topics, _ = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

The number of collections rises into 30 by placing a cluster_seled_method to leaf and min_cluster_men to 5.

To control the random of reproduction

Map Illoval is not basic, meaning that can produce different results from each walk unless you have clearly set up the randomized_state setup. This information is usually left in the exemplary code, so be sure to install it to ensure recycling.

Similarly, if you use API to embed a third party (such as Opelai, be careful. Some apis can compare small variations in repeated calls. With reorganization, traversing embryos and feeding directly in Bertopic.

from bertopic.backend import BaseEmbedder
import numpy as np
class CustomEmbedder(BaseEmbedder):
    """Light-weight wrapper to call NVIDIA's embedding endpoint via OpenAI SDK."""

    def __init__(self, embedding_model, client):
        super().__init__()
        self.embedding_model = embedding_model
        self.client = client

    def encode(self, documents):  # type: ignore[override]
        response = self.client.embeddings.create(
            input=documents,
            model=self.embedding_model,
            encoding_format="float",
            extra_body={"input_type": "passage", "truncate": "NONE"},
        )
        embeddings = np.array([embed.embedding for embed in response.data])
        return embeddings
topic_model.embedding_model = CustomEmbedder()
topics, probs = topic_model.fit_transform(texts_clean, embeddings=embeddings)

Every time the data domain moves exploring, think to describe the test methods and change the process of laughter. In this lesson, we will use the group configuration to enable the N_NIGHOGS to 5, Min_clister_size to 5, and Cluster_selection_MeTetode on “omom”. This is a combination of balance between cohesion and compliance.

Improving topic presentations

Representation plays an important role in making translated collections. Automatically, Bertopic forms unigram based presentations, which often cannot have enough context. In the following section, we will examine several techniques to enrich these seeds and improve the interpretation of the title.

NGRAM

N-Gram range

In Bertopic, Countverizer is the default tool to convert text data into the Bag-of-Wooms representations. Instead of depending on the normal unigrams, switch to BIGRAMS or TRIGRAMS Using NGRAM_RAGE IN COUNTVEMER. This simple change adds the most needed context.

As we review the representation only, the bertopic provides renewal – Topics to avoid resetting the model model and.

topic_model.update_topics(texts_clean, vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(2,3)))
topic_model.get_topic_info()

Custom tokozer

Some bigrams are still hard to interpret eg 486dx 50, AK UK, DXF Doc, … To be able to control a Custom tokozer filters n-grams based on speech patterns. This removes a noble combination and suggests the quality of your keywords.

import spacy
from typing import List

class ImprovedTokenizer:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
        self.MEANINGFUL_BIGRAMS = {
            ("ADJ", "NOUN"),
            ("NOUN", "NOUN"),
            ("VERB", "NOUN"),
        }
    # Keep only the most meaningful syntactic bigram patterns
    def __call__(self, text: str, max_tokens=200) -> List[str]:
        doc = self.nlp(text[:3000])  # truncate long docs for speed
        tokens = [(t.text, t.lemma_.lower(), t.pos_) for t in doc if t.is_alpha]
       
        bigrams = []
        for i in range(len(tokens) - 1):
            word1, lemma1, pos1 = tokens[i]
            word2, lemma2, pos2 = tokens[i + 1]
            if (pos1, pos2) in self.MEANINGFUL_BIGRAMS:
                # Optionally lowercase both words to normalize
                bigrams.append(f"{lemma1} {lemma2}")
       
        return bigrams
topic_model.update_topics(docs=texts_clean,vectorizer_model=CountVectorizer(tokenizer=ImprovedTokenizer()))
topic_model.get_topic_info()

Llm

Finally, you can do Mix the llms to produce titles or united summaries on each subject. Bertopic supports the operation of the opeai directly or for stealing. These llm-based knives are very promoting to explain.

import openai
from bertopic.representation import OpenAI

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
topic_model.update_topics(texts_clean, representation_model=OpenAI(client, model="gpt-4o-mini", delay_in_seconds=5))
topic_model.get_topic_info()

The presentings now have two sentences.

You can also write your work to find a LLM title, and update it back into the item with a model title using Update_Topic_labels Function. Please refer to the model code of the code below.

import openai
from typing import List
def generate_topic_titles_with_llm(
    topic_model,
    docs: List[str],
    api_key: str,
    model: str = "gpt-4o"
) -> Dict[int, Tuple[str, str]]:
    client = openai.OpenAI(api_key=api_key)
    topic_info = topic_model.get_topic_info()
    topic_repr = {}
    topics = topic_info[topic_info.Topic != -1].Topic.tolist()

    for topic in tqdm(topics, desc="Generating titles"):
        indices = [i for i, t in enumerate(topic_model.topics_) if t == topic]
        if not indices:
            continue
        top_doc = docs[indices[0]]

        prompt = f"""You are a helpful summarizer for topic clustering.
        Given the following text that represents a topic, generate:
        1. A short **title** for the topic (2–6 words)
        2. A one or two sentence **summary** of the topic.
        Text:
        """
        {top_doc}
        """
        """

        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant for summarizing topics."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.5
            )
            output = response.choices[0].message.content.strip()
            lines = output.split('n')
            title = lines[0].replace("Title:", "").strip()
            summary = lines[1].replace("Summary:", "").strip() if len(lines) > 1 else ""
            topic_repr[topic] = (title, summary)
        except Exception as e:
            print(f"Error with topic {topic}: {e}")
            topic_repr[topic] = ("[Error]", str(e))

    return topic_repr

topic_repr = generate_topic_titles_with_llm( topic_model, texts_clean, os.environ["OPENAI_API_KEY"])
topic_repr_dict = {
    topic: topic_repr.get(topic, "Topic")
    for topic in topic.get_topic_info()["Topic"]
 }
topic_model.set_topic_labels(topic_repr_dict)

Store

This guide describes active strategies to promote modeling theme results using a bertopic. By discerning each module of module and Tuning Parameter of your domain, you can access the focus of the most, stable, flexible articles.

The news of the representation is just false. Whether N-Grams grams, a harmonious sorting, or a better introduction, the better presentation makes your topics understand and use more to work.

Bertopic and offers advanced model strategies beyond the basic basic base. At the next post, we will examine those strength deeply. Stay tuned!

Source link

nimda August 12, 2025

0 24 6 minutes read