Machine Learning

Topic model label with llms | Looking at the data science

By: Petr Koráb*, Martin Felkircher **, *** Viktriya Teliha:

of the principles produced by the topic models require the domain experience and may be accessible to the label. Especially when the number of topics grow bigger, it may be easy to give people read people in terms of the LLM. To just copy and paste the consequences to UIS, like Chatgpt.com, “is a black” and unreasonable box. The better choice can install Libbuler in the written label code, which provides an engineer for additional control over the results and ensure birth. This course will examine in detail:

  • How To Train Total Model With New New Witchcraft Python package
  • How to moan the results of the title model with GPT-4.0 mini.

We will train cut boundary Obscure Model by Xiaobao Wu et al. [3] presented in last year's neurups. This model cries some competitive models, such as BartopicIn the metrics a few keywords (eg a topic variety) and there are broad applications in Intelligence Business.

1. Topic elements of the title pipe

Label is an integral part of the title of model pipes in the title because it closes the model results with the original world decisions. The model assigns a number to each topic, but the business decision relies on a readable label to a person summarizing the common words in each topic. Models are generally written by (1) Labellers with a domain experience, usually using a well-defined labeling strategy, (2) llMS, and (3) trade tools. Method from green data to make decisions about the title model is properly defined in the picture 1.

Image 1. Parts of a checking pipe theme.
Source: Transformed and extended from Kardos Et al [2].

The pipe begins with raw, installed data and is classified in the title model. The model returns the topics named in numbers that have numerical, including normal words (words or biigrams). The label layer replaces the title in the title of the title by the text label. The model user (Product Manager, Dept Care Dept. The following model model will follow them by step by step.

2. Data

We will use the Famopic to distinguish the customers' data for the customers in 10 topics. The case of an example uses the Systetical email produced by a customer care email, licensed under GPL-3 license. Private data that includes 692 incoming emails from customer maintenance department and looks like this:

Image 2. Customer Care Mail Data. Photo by authors.

2.1. Informal data

Text data is organized in order in six steps. Numbers are removed first, followed by emojis. The English closing words are removed afterwards, followed by the punctuation. Additional tokens (such as the company and personal name) removed from the next step before lemmatization. Learn more from the text for the priority of the title models in our previous study.

First, we read the pure data and receive data:

import pandas as pd

# Read data
data = pd.read_csv("data.csv", usecols=['message_clean'])

# Create corpus list
docs = data["message_clean"].tolist()
Pictures 3. Recommended for the cleaning pipeline of the title models. Photo by authors.

2.2. BIGRAM Vectitorhation

Next, we build a bigram tonezer to process the tokens as a bigrams during the model training. Biggram models provide appropriate information and identify important qualities and business decisions rather than single-word models (“Delivery” vs vs vs. “bad delivery”, “stomach” vs “crisis”, etc.).

from sklearn.feature_extraction.text import CountVectorizer

bigram_vectorizer = CountVectorizer(
    ngram_range=(2, 2),               # only bigrams
    max_features=1000                 # top 1000 bigrams by frequency
)

3. Exemplary training

Famopic model is currently used in two python packages:

  • Obscure: The official package is X. Wre
  • Witchcraft: New Python Package Brings many of the topic modes of the title, including labeling with llms [2]

We will use Turfttopic implementation due to direct link between model and nimer offering the llm label.

Let's bring a model and adapt to the data. It is important to set up a random situation to protect the achievement of training.

from turftopic import FASTopic

# Model specification
topic_size  = 10
model = FASTopic(n_components = topic_size,       # train for 10 topics
                 vectorizer = bigram_vectorizer,  # generate bigrams in topics
                 random_state = 32).fit(docs)     # set random state 

# Fit model to corpus
topic_data = model.prepare_topic_data(docs)

Now, let's prepare a dataframe with 10 topics and higher biigrams with the highest opportunity found in the model (code here).

Picture 4. Unlocking topics on fastopic. Photo by authors.

4. Subject to title

In the following step, we add text labels to title IDs with GPT4-O-mini. Let's follow these steps:

In this code, we enter the topics and add a new line Title_ame to data data.

from turftopic.namers import OpenAITopicNamer
import os

# OpenAI API key key to access GPT-4
os.environ["OPENAI_API_KEY"] = ""   

# use Namer to label topic model with LLM
namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

# create a dataframe with labelled topics
topics_df = model.topics_df()
topics_df.columns = ['topic_id', 'topic_name', 'topic_words']

# split and explode
topics_df['topic_word'] = topics_df['topic_words'].str.split(',')
topics_df = topics_df.explode('topic_word')
topics_df['topic_word'] = topics_df['topic_word'].str.strip()

# add a rank for each word within a topic
topics_df['word_rank'] = topics_df.groupby('topic_id').cumcount() + 1

# pivot to wide format
wide = topics_df.pivot(index='word_rank', 
                       columns=['topic_id', 'topic_name'], values='topic_word')

Here is a table with articles written in the label after an additional conversion. It would be interesting to compare the effects of the LLM and company company familiar with company procedures and customer support. Data for executive, so let's rely on GPT-4.

Photos 5 Photo by authors.

And we can imagine better presentations articles. BIGRAM WORD Cloud Visualization, produced by articles produced by the model, is here.

Photos 6 Photo by authors.

Summary

  • The new Turftopic Pythoni package connects the latest and latest articles to generate the words read by person.
  • The main benefits are: 1) independence from domestic experiences, 2) the amount of models of models with a large amount of the themer that may have difficulty independence, and 3).
  • The title of Lebuling with the llms has a variety of applications in various locations. Read our latest paper on the Modeling of Modeling of Central Bank to communicate, where GPT-4Bed a quick model.
  • Labels are slightly different in each training, even if a random status. It is not caused by Namer, but in random procedures in the output model of bigrams have the planning opportunities. The difference is by the opportunities in small decisions, so each training creates a few new words at the top 10, then affecting the LLM Baler.

Details and complete code of this course is here.

Petr Korab Is the Athlazi higher and the Originator of the mining issues of more than eight years of experience in Business Intelligence and NLP.

Subscribe For our blog to get the latest news from the NLP industry!

Progress

[1] Feldkircher, M., Korab, P., Teliha, V., (2025). “What is the middle of middle banks? Evidence from BIS

[2] Kardos, M., Nevolyen, KC, Kroskan, J. Kristensen-McLaclan, Rd, Rocca, R. (2025). Turftopic: The model of the topic of state from the sentences of the sentences. Open source software software(111), 8183,

[3] Wu, x, Nguyen, T., CE Zhang, D., Yang Wang, W., Luu, in (2024). Fastopic: Fast, right, strong, and transferred with a happy topic paradigm. Arxiv Repritt: 2405.17978.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button