Machine Learning

How to Fine-tune SLM for Emotion Recognition

Introduction

models (SLMs) are fine-tuned to categorize emotions as single outputs, capturing the entire emotional tone of a text. In many use cases, the positive-negative distinction does not tell the complete story that a company needs. Emotion-aware models go a step further, dividing emotions into emotional categories (“anger”, “approval”, “disappointment”etc.) and to assign probabilities to a set of emotions in a text. It is now possible to model emotional content in data sets that the company receives (customer tickets, emails, product-related conversations), and react quickly to changing situations.

In one of our latest projects, modeling emotions in online media, we need an open-weighted emotion recognition model with a flexible license, which maintains high levels of clarity, and, of course, benefits from the low costs associated with open models. We humbly prefer the European models, but Hugging Face didn't offer an alternative to the Mistral with an upgraded model card. One possible reason is that the most detailed training set for emotion recognition, the GoEmotions dataset of 28 emotions, is highly class-balanced. Fine-tuning the SLM on a high-quality measurement data set that performs well in experiments requires intensive focus.

We tackled the problem of class inequality through a combination of three strategies: (1) a sample the most represented emotional category, (2) to increase artificially i small classes with Nature's 2025 ISMOTE algorithm, and (3) weight i loss function. With this combination of techniques, MistralSmall-3.1.GoEmotionsnow released from Hugging Face, features the most targeted emotions relevant to our project with F1 > 0.7.

This article describes in detail how to fine-tune an open-weight SLM. We will also find:

  • How to pre-process unbalanced data in a LLM optimization class with 2025 ISMOTES algorithm.
  • How to classify emotions into emotion categories by adapting the Minimal Language Model to recognize emotions in text data.

2. Data

GoEmotions is a human-annotated dataset of 58k Reddit comments extracted from English-language subreddits and labeled with 27 emotion categories and “neutral” the label. It is a multi-label classification data set where each comment can be labeled TRUE for multiple emotions (eg “Hitting me. That just added some fun to it even though I wasn't trying to hit him” it is true to “having fun”again “angry”).

The dataset is released from TensorFlow Datasets under the Apache 2.0 License and contains 54,263 labeled documents. Here's how it looks:

Figure 1. GoEmotions dataset. Photo by the author.

After a quick inspection, we can see a high degree of inequality in the data where the neutrality category wins:

Figure 2. Class inequality in the GoEmotions dataset. Photo by the author.

3. Training set preprocessing

Our goal is to build a classifier to recognize 15 emotions in plain language text. Training with class inequality data may introduce bias, as a well-tuned model tends to favor the majority class and perform poorly on the minority, so pre-processing is important.

We used a combination of methods training set; validation and test sets remain constant to address class imbalances and increase the performance of target emotions (fear, sadness, disgust, disapproval, anger, anger, disappointment, hope, amusement, surprise, admiration, happiness, confusion, happiness, love):

  • We reduced the data by random sorting i “neutral” lines.
  • We performed artificial sampling of the most underrepresented emotion categories using ISMOTE (Improved Synthetic Minority Over-sampling Technique).

I SMOKE The algorithm extends the standard SMOTE method by (1) expanding the sample generation space and (2) optimizing the sample distribution. The artificially generated samples then have a more realistic data distribution than those generated by the original method.

Figure 3. Flowchart of the ISMOTE algorithm. Source: Scientific Reports.

By reducing the majority category and artificially increasing the minority categories to 4000 samples, we created a relatively limited set of fine tuning. More sample ISMOTE code is here.

Figure 4. Relative frequency label, train (augmented), validation, and test sets. Photo by the author.

4. SLM Fine tuning

Among the Mistral models, we chose the A small one class (Small-3.1-24B-Instruct-2503), which fits our GPU and provides the multilingual capabilities we need for the classifier. The Unsloth frame makes the repair steps easier and faster than Transformers:

1. Data loadingloading a set of pre-processed training, validation, and test sets. We use a 60:20:20 split.

2. It is loading i base model— loading Small-3.1–24B-Instruct-2503 into place.

3. Claim LoRA – reducing hardware requirements.

4. Multilabel wrapper with focus loss function – updates trainer to split multiple labels. It also adds a weight loss focused function to the loss of a selected set of emotions, prioritizing its effectiveness.

5. Analytical metrics again training args— specifying evaluation metrics and model training parameters.

6. Model training– creation and implementation of the trainer.

7. Testing – To evaluate the performance of the best model in the test set.

4.1. Coding

Here is the code implementation.

4.1.1. Data loading

# Loading augmented train, validation and test sets
BASE = r"augmented"

def load_split(path: str) -> Dataset:
    with open(path, encoding="utf-8") as f:
        d = json.load(f)
    return Dataset.from_dict({"input_embeds": d["X"], "labels": d["y"]})

train_dataset = load_split(f"{BASE}/train.json")
val_dataset   = load_split(f"{BASE}/val.json")
test_dataset  = load_split(f"{BASE}/test.json")

# Formulate embedding dimension
EMBED_DIM = len(train_dataset[0]["input_embeds"])

# Return Pytorch tensors
train_dataset.set_format("torch")
val_dataset.set_format("torch")
test_dataset.set_format("torch")

4.1.2. Loading the base model

# Load base model with Unsloth FastLanguageModel
MODEL_NAME = "unsloth/Mistral-Small-3.1-24B-Instruct-2503"

base_model, _ = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=2,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)

4.1.3. Apply for LoRA

# Aply Low-rank adaptation (LoRA) 
base_model = FastLanguageModel.get_peft_model(
    base_model,
    r=16,
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing="unsloth",
    random_state = 3407,
    use_rslora = False, 
)

4.1.4. Multilabel wrapper with focus loss function

# Focal loss weights for preffered labels  
FOCAL_ALPHA_DEFAULT   = 0.25
FOCAL_ALPHA_PREFERRED = 0.75

PREFERRED_LABELS = {
    "fear", "sadness", "disgust", "disapproval", "annoyance",
    "anger", "disappointment", "optimism", "amusement", "surprise",
    "admiration", "excitement", "confusion","joy","love"
}

FOCAL_ALPHA_PER_LABEL: list[float] = [
    FOCAL_ALPHA_PREFERRED if lbl in PREFERRED_LABELS else FOCAL_ALPHA_DEFAULT
    for lbl in EMOTION_LABELS
]

"Per-label weighted focal binary cross-entropy for multi-label problems"
class FocalLossWithAlpha(nn.Module):
        def __init__(self, alpha: list[float], gamma: float = 2.0):
        super().__init__()
        self.register_buffer("alpha", torch.tensor(alpha, dtype=torch.float32))
        self.gamma = gamma
    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        probs   = torch.sigmoid(logits)
        p_t     = probs * targets + (1.0 - probs) * (1.0 - targets)
        alpha_t = self.alpha * targets + (1.0 - self.alpha) * (1.0 - targets)
        focal_w = alpha_t * (1.0 - p_t) ** self.gamma
        bce     = nn.functional.binary_cross_entropy_with_logits(
            logits, targets, reduction="none"
        )
        return (focal_w * bce).mean()
# Multilabel classification wrapper with focal loss class weighting
class MistralForMultiLabel(nn.Module):
    is_loaded_in_4bit = True

    def __init__(self, backbone: nn.Module, num_labels: int,
                 hidden_size: int, embed_dim: int):
        super().__init__()
        self.backbone = backbone
        _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.projection = nn.Sequential(
            nn.Linear(embed_dim, hidden_size // 2),
            nn.GELU(),
            nn.Linear(hidden_size // 2, hidden_size),
        ).to(_device)
        self.dropout    = nn.Dropout(0.1).to(_device)
        self.classifier = nn.Linear(hidden_size, num_labels).to(_device)
        self.focal_loss = FocalLossWithAlpha(FOCAL_ALPHA_PER_LABEL).to(_device)

    def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):
        self.backbone.gradient_checkpointing_enable(gradient_checkpointing_kwargs)

    def gradient_checkpointing_disable(self):
        self.backbone.gradient_checkpointing_disable()

    def forward(
        self,
        input_embeds: torch.Tensor,
        labels: torch.Tensor | None = None,
        **kwargs,
    ):
        B = input_embeds.size(0)
        projected = self.projection(input_embeds).unsqueeze(1)
        attn_mask = torch.ones(B, 1, device=input_embeds.device)

        outputs = self.backbone.base_model.model.model(
            inputs_embeds=projected,
            attention_mask=attn_mask,
            output_hidden_states=True,
        )
        pooled = outputs.hidden_states[-1][:, 0, :]
        logits = self.classifier(self.dropout(pooled))

        loss = self.focal_loss(logits, labels.float()) if labels is not None else None
        return {"loss": loss, "logits": logits}

4.1.5. Test metrics and training args

# Specifiy the evaluation function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.sigmoid(torch.tensor(logits)).numpy()
    preds = (probs >= 0.5).astype(int)
    labels = labels.astype(int)

    from sklearn.metrics import accuracy_score

    exact_accuracy  = accuracy_score(labels, preds)
    macro_f1        = f1_score(labels, preds, average="macro", zero_division=0)
    micro_f1        = f1_score(labels, preds, average="micro", zero_division=0)
    macro_precision = precision_score(labels, preds, average="macro", zero_division=0)
    macro_recall    = recall_score(labels, preds, average="macro", zero_division=0)

    per_class_f1        = f1_score(labels, preds, average=None, zero_division=0)
    per_class_recall    = recall_score(labels, preds, average=None, zero_division=0)
    per_class_precision = precision_score(labels, preds, average=None, zero_division=0)
    per_class_accuracy  = (preds == labels).mean(axis=0)

    per_class_metrics = {}
    for i, emotion in enumerate(EMOTION_LABELS):
        per_class_metrics[f"f1_{emotion}"]        = float(per_class_f1[i])
        per_class_metrics[f"recall_{emotion}"]    = float(per_class_recall[i])
        per_class_metrics[f"precision_{emotion}"] = float(per_class_precision[i])
        per_class_metrics[f"accuracy_{emotion}"]  = float(per_class_accuracy[i])

    return {
        "exact_accuracy":   exact_accuracy,
        "macro_f1":         macro_f1,
        "micro_f1":         micro_f1,
        "macro_precision":  macro_precision,
        "macro_recall":     macro_recall,
        **per_class_metrics,
    }
# Specify hyperparameters
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,            # where checkpoints and logs are written
    eval_strategy="epoch",            # run evaluation once per epoch
    save_strategy="epoch",            # save checkpoint once per epoch
    per_device_train_batch_size=8,    # samples per GPU per step
    per_device_eval_batch_size=16,    # larger batch is fine — no gradients
    gradient_accumulation_steps=4,    # effective batch = 8 × 4 = 32
    num_train_epochs=15,              # total passes over the training data
    learning_rate=1e-4,               # peak LR after warmup
    bf16=True,                        # bfloat16 mixed precision
    optim="adamw_8bit",               # 8-bit AdamW
    warmup_ratio=0.05,                # first 5 % of steps ramp LR from 0 to peak
    lr_scheduler_type="cosine",       # cosine decay from peak LR to ~0
    logging_steps=25,                 # print loss/LR to console every 25 steps
    logging_first_step=True,          # also log step 1 to catch early instability
    load_best_model_at_end=True,      # restore best checkpoint after training ends
    metric_for_best_model="macro_f1", # criterion used to select the best checkpoint
    greater_is_better=True,           # higher macro_f1 is better in evaluation
    gradient_checkpointing=False,    
    remove_unused_columns=False,      # keep input_embeds column
    save_total_limit=15,              # keep all checkpoints on disk to load the best model
    weight_decay=0.01,                # L2 regularisation on all trainable parameters
)

4.1.6. Model training

# Set-up the trainer for multilabel finetuning
class MultiLabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs, labels=labels)
        loss = outputs["loss"]
        return (loss, outputs) if return_outputs else loss

    def _save_checkpoint(self, model, trial, metrics=None):
        super()._save_checkpoint(model, trial)
        ckpt_dir = self._get_output_dir(trial)
        # Save head
        torch.save({
            "projection": model.projection.state_dict(),
            "classifier":  model.classifier.state_dict(),
        }, os.path.join(ckpt_dir, "head_weights.pt"))
        # Save LoRA adapter explicitly (bypasses bitsandbytes serialization issues)
        model.backbone.save_pretrained(os.path.join(ckpt_dir, "lora_adapter"))

    def _load_best_model(self):
        best_ckpt = self.state.best_model_checkpoint
        if not best_ckpt:
            return
        # Restore head
        head_path = os.path.join(best_ckpt, "head_weights.pt")
        if os.path.exists(head_path):
            head = torch.load(head_path, map_location="cpu")
            self.model.projection.load_state_dict(head["projection"])
            self.model.classifier.load_state_dict(head["classifier"])
            print(f"Head restored from: {best_ckpt}")
        else:
            print(f"WARNING: head_weights.pt not found in {best_ckpt}")
        # Restore LoRA adapter
        lora_path = os.path.join(best_ckpt, "lora_adapter")
        if os.path.exists(lora_path):
            from peft import PeftModel
            self.model.backbone.load_adapter(lora_path, adapter_name="default")
            print(f"LoRA restored from: {best_ckpt}")
        else:
            print(f"WARNING: lora_adapter/ not found in {best_ckpt}")

# Launch the trainer
trainer = MultiLabelTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Launch training
trainer.train()

The 15-period optimization took 9 hours and 30 minutes on a machine with an NVIDIA RTX 6000 GPU and 192 GB of VRAM, with the best model loaded last.

4.1.7. Model testing

Let's demonstrate the performance on a test dataset. Standard model test statistics for each stage are F1, Accuracyagain Remember. We can see the best performance in target emotions, with F1 scores above 0.7, in most cases. Full functionality is on the model card.

Emotion Accuracy Remember F1 N
to be praised 0.7415 0.6354 0.6844 993
entertainment 0.7810 0.7422 0.7611 543
anger 0.7423 0.7367 0.7395 395
to be angry 0.7049 0.5452 0.6148 609
confusion 0.7576 0.8251 0.7899 303
disappointment 0.8487 0.8459 0.8473 305
disapproval 0.7208 0.5841 0.6453 517
disgust 0.8396 0.9368 0.8856 190
happiness 0.8240 0.9366 0.8767 205
fear 0.9112 0.9686 0.9390 159
happiness 0.7577 0.8024 0.7794 339
love 0.7424 0.7903 0.7656 496
hope 0.8145 0.7636 0.7882 368
sadness 0.8534 0.8899 0.8713 327
to be surprised 0.8456 0.8555 0.8505 256
Great accuracy 0.8295
Macro remember 0.8184
Micro F1 0.7527
Macro F1 0.8215
Table 1: Performance of Mistral Small 3.1-GoEmotions on the test set

5. Summary

Now let's summarize the key points in the article. Requirements and full code are in this repo.

  • Emotion recognition modeling extends sentiment analysis by dividing the total sentiment score into its component sentiments.
  • MistralSmall-3.1.GoEmotions it is open A Hugging Face under the Apache 2.0 license. The repo also includes an indexing guide.
  • Morning use cases brand and social monitoring, and email segmentation.

Petr Koráb is the founder of Text Mining Stories, a Prague-based development and consulting company. Read more about advanced NLP on our blog.

AI statement. Some parts of the code have been updated by Sonnet 4.6 (Cursor). No text is generated using AI.

Thank you. The National Bank of Slovakia Foundation supported this development. I thank Martin Feldkircher, Václav Jež, and Michala Moravcová for comments and suggestions.

References

[1] Ying Li, Yali Yang, Peihua Song, Lian Duan, Rui Ren. 2025. An improved SMOTE algorithm for improved uneven data segmentation by expanding the sample generation space. Scientific Reports, 15 (23521).

[2] Yinhan LiuJiatao GuNaman Goyal, Xian Li, Sergey Edunov
Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 2020. Multiling Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics, 8, pp. 726-742.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button