How to Fine-tune SLM for Emotion Recognition

Introduction
models (SLMs) are fine-tuned to categorize emotions as single outputs, capturing the entire emotional tone of a text. In many use cases, the positive-negative distinction does not tell the complete story that a company needs. Emotion-aware models go a step further, dividing emotions into emotional categories (“anger”, “approval”, “disappointment”etc.) and to assign probabilities to a set of emotions in a text. It is now possible to model emotional content in data sets that the company receives (customer tickets, emails, product-related conversations), and react quickly to changing situations.
In one of our latest projects, modeling emotions in online media, we need an open-weighted emotion recognition model with a flexible license, which maintains high levels of clarity, and, of course, benefits from the low costs associated with open models. We humbly prefer the European models, but Hugging Face didn't offer an alternative to the Mistral with an upgraded model card. One possible reason is that the most detailed training set for emotion recognition, the GoEmotions dataset of 28 emotions, is highly class-balanced. Fine-tuning the SLM on a high-quality measurement data set that performs well in experiments requires intensive focus.
We tackled the problem of class inequality through a combination of three strategies: (1) a sample the most represented emotional category, (2) to increase artificially i small classes with Nature's 2025 ISMOTE algorithm, and (3) weight i loss function. With this combination of techniques, MistralSmall-3.1.GoEmotionsnow released from Hugging Face, features the most targeted emotions relevant to our project with F1 > 0.7.
This article describes in detail how to fine-tune an open-weight SLM. We will also find:
- How to pre-process unbalanced data in a LLM optimization class with 2025 ISMOTES algorithm.
- How to classify emotions into emotion categories by adapting the Minimal Language Model to recognize emotions in text data.
2. Data
GoEmotions is a human-annotated dataset of 58k Reddit comments extracted from English-language subreddits and labeled with 27 emotion categories and “neutral” the label. It is a multi-label classification data set where each comment can be labeled TRUE for multiple emotions (eg “Hitting me. That just added some fun to it even though I wasn't trying to hit him” it is true to “having fun”again “angry”).
The dataset is released from TensorFlow Datasets under the Apache 2.0 License and contains 54,263 labeled documents. Here's how it looks:
After a quick inspection, we can see a high degree of inequality in the data where the neutrality category wins:

3. Training set preprocessing
Our goal is to build a classifier to recognize 15 emotions in plain language text. Training with class inequality data may introduce bias, as a well-tuned model tends to favor the majority class and perform poorly on the minority, so pre-processing is important.
We used a combination of methods training set; validation and test sets remain constant to address class imbalances and increase the performance of target emotions (fear, sadness, disgust, disapproval, anger, anger, disappointment, hope, amusement, surprise, admiration, happiness, confusion, happiness, love):
- We reduced the data by random sorting i “neutral” lines.
- We performed artificial sampling of the most underrepresented emotion categories using ISMOTE (Improved Synthetic Minority Over-sampling Technique).
I SMOKE The algorithm extends the standard SMOTE method by (1) expanding the sample generation space and (2) optimizing the sample distribution. The artificially generated samples then have a more realistic data distribution than those generated by the original method.

By reducing the majority category and artificially increasing the minority categories to 4000 samples, we created a relatively limited set of fine tuning. More sample ISMOTE code is here.

4. SLM Fine tuning
Among the Mistral models, we chose the A small one class (Small-3.1-24B-Instruct-2503), which fits our GPU and provides the multilingual capabilities we need for the classifier. The Unsloth frame makes the repair steps easier and faster than Transformers:
1. Data loading – loading a set of pre-processed training, validation, and test sets. We use a 60:20:20 split.
2. It is loading i base model— loading Small-3.1–24B-Instruct-2503 into place.
3. Claim LoRA – reducing hardware requirements.
4. Multilabel wrapper with focus loss function – updates trainer to split multiple labels. It also adds a weight loss focused function to the loss of a selected set of emotions, prioritizing its effectiveness.
5. Analytical metrics again training args— specifying evaluation metrics and model training parameters.
6. Model training– creation and implementation of the trainer.
7. Testing – To evaluate the performance of the best model in the test set.
4.1. Coding
Here is the code implementation.
4.1.1. Data loading
# Loading augmented train, validation and test sets
BASE = r"augmented"
def load_split(path: str) -> Dataset:
with open(path, encoding="utf-8") as f:
d = json.load(f)
return Dataset.from_dict({"input_embeds": d["X"], "labels": d["y"]})
train_dataset = load_split(f"{BASE}/train.json")
val_dataset = load_split(f"{BASE}/val.json")
test_dataset = load_split(f"{BASE}/test.json")
# Formulate embedding dimension
EMBED_DIM = len(train_dataset[0]["input_embeds"])
# Return Pytorch tensors
train_dataset.set_format("torch")
val_dataset.set_format("torch")
test_dataset.set_format("torch")
4.1.2. Loading the base model
# Load base model with Unsloth FastLanguageModel
MODEL_NAME = "unsloth/Mistral-Small-3.1-24B-Instruct-2503"
base_model, _ = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=2,
load_in_4bit=True,
dtype=torch.bfloat16,
)
4.1.3. Apply for LoRA
# Aply Low-rank adaptation (LoRA)
base_model = FastLanguageModel.get_peft_model(
base_model,
r=16,
lora_alpha=32,
lora_dropout=0,
bias="none",
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
use_gradient_checkpointing="unsloth",
random_state = 3407,
use_rslora = False,
)
4.1.4. Multilabel wrapper with focus loss function
# Focal loss weights for preffered labels
FOCAL_ALPHA_DEFAULT = 0.25
FOCAL_ALPHA_PREFERRED = 0.75
PREFERRED_LABELS = {
"fear", "sadness", "disgust", "disapproval", "annoyance",
"anger", "disappointment", "optimism", "amusement", "surprise",
"admiration", "excitement", "confusion","joy","love"
}
FOCAL_ALPHA_PER_LABEL: list[float] = [
FOCAL_ALPHA_PREFERRED if lbl in PREFERRED_LABELS else FOCAL_ALPHA_DEFAULT
for lbl in EMOTION_LABELS
]
"Per-label weighted focal binary cross-entropy for multi-label problems"
class FocalLossWithAlpha(nn.Module):
def __init__(self, alpha: list[float], gamma: float = 2.0):
super().__init__()
self.register_buffer("alpha", torch.tensor(alpha, dtype=torch.float32))
self.gamma = gamma
def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
probs = torch.sigmoid(logits)
p_t = probs * targets + (1.0 - probs) * (1.0 - targets)
alpha_t = self.alpha * targets + (1.0 - self.alpha) * (1.0 - targets)
focal_w = alpha_t * (1.0 - p_t) ** self.gamma
bce = nn.functional.binary_cross_entropy_with_logits(
logits, targets, reduction="none"
)
return (focal_w * bce).mean()
# Multilabel classification wrapper with focal loss class weighting
class MistralForMultiLabel(nn.Module):
is_loaded_in_4bit = True
def __init__(self, backbone: nn.Module, num_labels: int,
hidden_size: int, embed_dim: int):
super().__init__()
self.backbone = backbone
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.projection = nn.Sequential(
nn.Linear(embed_dim, hidden_size // 2),
nn.GELU(),
nn.Linear(hidden_size // 2, hidden_size),
).to(_device)
self.dropout = nn.Dropout(0.1).to(_device)
self.classifier = nn.Linear(hidden_size, num_labels).to(_device)
self.focal_loss = FocalLossWithAlpha(FOCAL_ALPHA_PER_LABEL).to(_device)
def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):
self.backbone.gradient_checkpointing_enable(gradient_checkpointing_kwargs)
def gradient_checkpointing_disable(self):
self.backbone.gradient_checkpointing_disable()
def forward(
self,
input_embeds: torch.Tensor,
labels: torch.Tensor | None = None,
**kwargs,
):
B = input_embeds.size(0)
projected = self.projection(input_embeds).unsqueeze(1)
attn_mask = torch.ones(B, 1, device=input_embeds.device)
outputs = self.backbone.base_model.model.model(
inputs_embeds=projected,
attention_mask=attn_mask,
output_hidden_states=True,
)
pooled = outputs.hidden_states[-1][:, 0, :]
logits = self.classifier(self.dropout(pooled))
loss = self.focal_loss(logits, labels.float()) if labels is not None else None
return {"loss": loss, "logits": logits}
4.1.5. Test metrics and training args
# Specifiy the evaluation function
def compute_metrics(eval_pred):
logits, labels = eval_pred
probs = torch.sigmoid(torch.tensor(logits)).numpy()
preds = (probs >= 0.5).astype(int)
labels = labels.astype(int)
from sklearn.metrics import accuracy_score
exact_accuracy = accuracy_score(labels, preds)
macro_f1 = f1_score(labels, preds, average="macro", zero_division=0)
micro_f1 = f1_score(labels, preds, average="micro", zero_division=0)
macro_precision = precision_score(labels, preds, average="macro", zero_division=0)
macro_recall = recall_score(labels, preds, average="macro", zero_division=0)
per_class_f1 = f1_score(labels, preds, average=None, zero_division=0)
per_class_recall = recall_score(labels, preds, average=None, zero_division=0)
per_class_precision = precision_score(labels, preds, average=None, zero_division=0)
per_class_accuracy = (preds == labels).mean(axis=0)
per_class_metrics = {}
for i, emotion in enumerate(EMOTION_LABELS):
per_class_metrics[f"f1_{emotion}"] = float(per_class_f1[i])
per_class_metrics[f"recall_{emotion}"] = float(per_class_recall[i])
per_class_metrics[f"precision_{emotion}"] = float(per_class_precision[i])
per_class_metrics[f"accuracy_{emotion}"] = float(per_class_accuracy[i])
return {
"exact_accuracy": exact_accuracy,
"macro_f1": macro_f1,
"micro_f1": micro_f1,
"macro_precision": macro_precision,
"macro_recall": macro_recall,
**per_class_metrics,
}
# Specify hyperparameters
training_args = TrainingArguments(
output_dir=OUTPUT_DIR, # where checkpoints and logs are written
eval_strategy="epoch", # run evaluation once per epoch
save_strategy="epoch", # save checkpoint once per epoch
per_device_train_batch_size=8, # samples per GPU per step
per_device_eval_batch_size=16, # larger batch is fine — no gradients
gradient_accumulation_steps=4, # effective batch = 8 × 4 = 32
num_train_epochs=15, # total passes over the training data
learning_rate=1e-4, # peak LR after warmup
bf16=True, # bfloat16 mixed precision
optim="adamw_8bit", # 8-bit AdamW
warmup_ratio=0.05, # first 5 % of steps ramp LR from 0 to peak
lr_scheduler_type="cosine", # cosine decay from peak LR to ~0
logging_steps=25, # print loss/LR to console every 25 steps
logging_first_step=True, # also log step 1 to catch early instability
load_best_model_at_end=True, # restore best checkpoint after training ends
metric_for_best_model="macro_f1", # criterion used to select the best checkpoint
greater_is_better=True, # higher macro_f1 is better in evaluation
gradient_checkpointing=False,
remove_unused_columns=False, # keep input_embeds column
save_total_limit=15, # keep all checkpoints on disk to load the best model
weight_decay=0.01, # L2 regularisation on all trainable parameters
)
4.1.6. Model training
# Set-up the trainer for multilabel finetuning
class MultiLabelTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
labels = inputs.pop("labels")
outputs = model(**inputs, labels=labels)
loss = outputs["loss"]
return (loss, outputs) if return_outputs else loss
def _save_checkpoint(self, model, trial, metrics=None):
super()._save_checkpoint(model, trial)
ckpt_dir = self._get_output_dir(trial)
# Save head
torch.save({
"projection": model.projection.state_dict(),
"classifier": model.classifier.state_dict(),
}, os.path.join(ckpt_dir, "head_weights.pt"))
# Save LoRA adapter explicitly (bypasses bitsandbytes serialization issues)
model.backbone.save_pretrained(os.path.join(ckpt_dir, "lora_adapter"))
def _load_best_model(self):
best_ckpt = self.state.best_model_checkpoint
if not best_ckpt:
return
# Restore head
head_path = os.path.join(best_ckpt, "head_weights.pt")
if os.path.exists(head_path):
head = torch.load(head_path, map_location="cpu")
self.model.projection.load_state_dict(head["projection"])
self.model.classifier.load_state_dict(head["classifier"])
print(f"Head restored from: {best_ckpt}")
else:
print(f"WARNING: head_weights.pt not found in {best_ckpt}")
# Restore LoRA adapter
lora_path = os.path.join(best_ckpt, "lora_adapter")
if os.path.exists(lora_path):
from peft import PeftModel
self.model.backbone.load_adapter(lora_path, adapter_name="default")
print(f"LoRA restored from: {best_ckpt}")
else:
print(f"WARNING: lora_adapter/ not found in {best_ckpt}")
# Launch the trainer
trainer = MultiLabelTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)
# Launch training
trainer.train()
The 15-period optimization took 9 hours and 30 minutes on a machine with an NVIDIA RTX 6000 GPU and 192 GB of VRAM, with the best model loaded last.
4.1.7. Model testing
Let's demonstrate the performance on a test dataset. Standard model test statistics for each stage are F1, Accuracyagain Remember. We can see the best performance in target emotions, with F1 scores above 0.7, in most cases. Full functionality is on the model card.
| Emotion | Accuracy | Remember | F1 | N |
| to be praised | 0.7415 | 0.6354 | 0.6844 | 993 |
| entertainment | 0.7810 | 0.7422 | 0.7611 | 543 |
| anger | 0.7423 | 0.7367 | 0.7395 | 395 |
| to be angry | 0.7049 | 0.5452 | 0.6148 | 609 |
| confusion | 0.7576 | 0.8251 | 0.7899 | 303 |
| disappointment | 0.8487 | 0.8459 | 0.8473 | 305 |
| disapproval | 0.7208 | 0.5841 | 0.6453 | 517 |
| disgust | 0.8396 | 0.9368 | 0.8856 | 190 |
| happiness | 0.8240 | 0.9366 | 0.8767 | 205 |
| fear | 0.9112 | 0.9686 | 0.9390 | 159 |
| happiness | 0.7577 | 0.8024 | 0.7794 | 339 |
| love | 0.7424 | 0.7903 | 0.7656 | 496 |
| hope | 0.8145 | 0.7636 | 0.7882 | 368 |
| sadness | 0.8534 | 0.8899 | 0.8713 | 327 |
| to be surprised | 0.8456 | 0.8555 | 0.8505 | 256 |
| Great accuracy | 0.8295 | |||
| Macro remember | 0.8184 | |||
| Micro F1 | 0.7527 | |||
| Macro F1 | 0.8215 |
5. Summary
Now let's summarize the key points in the article. Requirements and full code are in this repo.
- Emotion recognition modeling extends sentiment analysis by dividing the total sentiment score into its component sentiments.
- MistralSmall-3.1.GoEmotions it is open A Hugging Face under the Apache 2.0 license. The repo also includes an indexing guide.
- Morning use cases brand and social monitoring, and email segmentation.
Petr Koráb is the founder of Text Mining Stories, a Prague-based development and consulting company. Read more about advanced NLP on our blog.
AI statement. Some parts of the code have been updated by Sonnet 4.6 (Cursor). No text is generated using AI.
Thank you. The National Bank of Slovakia Foundation supported this development. I thank Martin Feldkircher, Václav Jež, and Michala Moravcová for comments and suggestions.
References
[1] Ying Li, Yali Yang, Peihua Song, Lian Duan, Rui Ren. 2025. An improved SMOTE algorithm for improved uneven data segmentation by expanding the sample generation space. Scientific Reports, 15 (23521).
[2] Yinhan LiuJiatao GuNaman Goyal, Xian Li, Sergey Edunov
Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 2020. Multiling Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics, 8, pp. 726-742.



