Machine Learning

Red: Measuring text classification by professional team

With the new age of solving problems with large models of language (llms), only a few of the remaining problems with the author's solutions. Many of the distances of separation (at the POC level) can be solved by installing llms in the accuracy of 70-90% / F1 with good engineering strategies, and modes of unable In-Context (ICL).

What happens when you want to achieve the consistent suce rather – where the fast development is no longer enough?

The coconumrum to distinguish

The text separation is one of the oldest and well-understandable examples of monitored learning. Given this building, it should not be difficult to create Clears, performing the clectifiers that treat a large number of classics, right …?

Welp. It is.

In fact, much should be done “with problems' algorithm usually expected to work below:

  • The low amount of training data per class
  • The high accuracy of high separation (which plummets as you add more classes)
  • the most probable added to New Classes In the existing layout of classes
  • Quick Training / Quick Discovery
  • Cost efficiency
  • (Potential) the largest number of training classes
  • (possible) permanent worthworthy Reimbursement single Classes due to driving data, etc.

Have you ever tried to build a clessifier than a few classes under these conditions? (I mean, even GPT may have done a good job up to 30 Scriptural classes with a few samples …)

Thinking of GPT route – If you have more than 10 classes or a cheap number of details that should be classified, you will reach the depths of your packs, user's immediate and shooting. one sample. That is after the silence of API transmission, whether you run async questions.

In Appled ML, problems such as these clever are usually solving because they have full satisfaction for supervisor or non-affiliated learning requirements. This is a piece of pain that is what the redes of the red algorithm are: for the addressed learning, where training data for each class is not enough to build (QUASI) of traditional neighbors.

Red algorithm

Red: Transfers of Recycling Specialists Is the veterinary structure that changes our speech and the separation of the text. This is the ml-used paradigm of ml – ie, no differently different The structure of what is present, but its brightness of ideas that work better to form an active and scale.

In this post, we will be working on a specific example when we have a large number of text classes (100-1000), each section has fewer (30-100) samples, and there is nothing more than a lonely sample number (10,000-100,000). We approach this as Insufficient Reading The problem with RED

Let's get in.

How does this work

Simple representation of the reddressings are made

Instead of having a single classification of the large number of classes, it is in unity:

  1. He divides and wins – Break out the label space (a large number of insert labels) into multiple subsets. This is a way to greedy a structure label.
  2. It reads well – Treefi special crisefiers of each subset. This step focuses on creating a classifier that is the past of the noise, where the audio represents as a data from Some items are entered.
  3. Delegates in a professional – LLMS employment such as specific words for specific label confirmation and repairs, such as a group of domain specialists. Using a LLM as a proxy, really 'Mimics' How A person scholar guarantees the outcome.
  4. Reinstatement – Refray continuously with new added samples to risk until no samples will be added / filling from access to information available

Intuition is not too difficult to understand: A practical learning uses people as a professional domain in examining 'or' to ensure 'exit from the ML model, on continuous training. This stops when the model reaches acceptable performance. Intuit also sequently multiplied, new intelligent things will be specified in pre-research publication later.

Let's look deep …

Greedy greedy is the same stuff in common

When the number of installations labels (classes) is high, the difficulty of learning a decision to make a decision between classes that rise. Thus, the classifier quality deteriorates as the number of classes are increasing. This is especially true when the classifier does not have enough samples To learn from – meaning each training classes with only a few samples.

This is very illustrating the actual state of the world, and a basic motivation after the red creature

Certain ways to improve the performance of the cressifier under these issues:

  • Conclusion The number of classes need to distinguish between
  • Make a decision bound between clear classes, that is, train classifier on Classics are very different

Lowly greedy submission makes this direct Cheeks Subsets from them. Each of the Cheeks Subsets have items as ni Training labels. We choose to greedy training, ensure that all labels prefer the worst number of labeling labeling labeling labeles:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def avg_embedding(candidate_embeddings):
    return np.mean(candidate_embeddings, axis=0)

def get_least_similar_embedding(target_embedding, candidate_embeddings):
    similarities = cosine_similarity(target_embedding, candidate_embeddings)
    least_similar_index = np.argmin(similarities)  # Use argmin to find the index of the minimum
    least_similar_element = candidate_embeddings[least_similar_index]
    return least_similar_element


def get_embedding_class(embedding, embedding_map):
    reverse_embedding_map = {value: key for key, value in embedding_map.items()}
    return reverse_embedding_map.get(embedding)  # Use .get() to handle missing keys gracefully


def select_subsets(embeddings, n):
    visited = {cls: False for cls in embeddings.keys()}
    subsets = []
    current_subset = []

    while any(not visited[cls] for cls in visited):
        for cls, average_embedding in embeddings.items():
            if not current_subset:
                current_subset.append(average_embedding)
                visited[cls] = True
            elif len(current_subset) >= n:
                subsets.append(current_subset.copy())
                current_subset = []
            else:
                subset_average = avg_embedding(current_subset)
                remaining_embeddings = [emb for cls_, emb in embeddings.items() if not visited[cls_]]
                if not remaining_embeddings:
                    break # handle edge case
                
                least_similar = get_least_similar_embedding(target_embedding=subset_average, candidate_embeddings=remaining_embeddings)

                visited_class = get_embedding_class(least_similar, embeddings)

                
                if visited_class is not None:
                  visited[visited_class] = True


                current_subset.append(least_similar)
    
    if current_subset:  # Add any remaining elements in current_subset
        subsets.append(current_subset)
        

    return subsets

The effect of this movable sample of transportation is all the training labels listed clearly in subsets, where each subset has only ni classes. This makes the classifier easy, compared to the original Cheeks Classes will have to divide in another way!

Designated-based subdivision-related separation overvosampling

Cascade this after the first label label label – ie, this classifier only distinguish between the provided resolve classes.

See: If you have lower training data prices, you cannot create a hosting set and receive a test. Should you do it at all? How do you know if your classifier works well?

We drew near the problem in a different way – explained the basic carpented job to be Pre-with a sample separation. This means that no sample is separated by 'and being corrected' later: the incident only needs to be assured what needs to be assured what needs to be assured what needs to be sure what needs to be It is assured what needs to be assured what needs to be assured what needs to be assured what needs to be assured what needs to be assured what needs to be assured.

As well as we created a design that we can treat its data:

  • n + 1 Classes, where the last class is noise
  • Sound: Details from the Classifier current viewings. Sound class is extremely organized to be 2x normal size of the classifier data label

Soundness is a sound safety way, to ensure that the closest data of the other class may have been very noisy instead of verification.

How do you check whether the classifier works well – in our research, explaining this as a 'uncertain' number in classifier prediction. The sample sample and uncertainty principles and principles are successfully able to measure when the classifier 'reads' or not, which works as a pointer in planning. The building is consistently built unless there is a point of contradiction at the amount of unsure of predicted, or there is only the Delta of the information manifested with the Iteratious Information.

An effective reading of a proxy for the llm agent

This is a way of way – using a llm as a proxy of a person guarantee. The Human Way We Speak With Us Is an Active Lab

Let us find a visible insight into effective labeling:

  • Use ML model to learn from sample installation Database, predicting a large set of datapoonts
  • For predications given to datapoonts, a specialist who is subject (SME) examining 'verification' of predictions
  • By repeating, new “repairs' samples are included as the training information in the ML model
  • ML model is regularly / converted, and makes predicted until SME is satisfied with the quality of prediction

For a valid label to operate, expect to be affected by SME:

  • When we expect a person's professional 'guarantee' sample exit, the expert understands that work
  • A person scholar will use judgment of “what else” is the label L When you decide if the new sample should be yours L

Given this expectations and Intuitions, 'we can imitate' these use of llm:

  • Give the llm 'understanding' that each label means. This can be done by using a large model to Analyze the relationship well Between our label in our exam, this is done using a 32B different from deepseek That treated me.
Giving the llm strength to understand 'Why, and how'
  • Instead of predicting what the correct label is, Find the llm to identify when the prediction is 'permissible' or 'not only valid' (Ie, the only llm must answer binary question).
  • Emphasize the idea that some of the allowed label samples look at, ie, on all the predicted labels of the sample, a motivational source c The closest samples are closest to their training (guaranteed guaranteed) a set when delivery.

The result? The most expensive frame that depends on the fastest, cheap class to make a pre-energy separation, and llm confirming this using (the word + of training training with the current compilation):

import math

def calculate_uncertainty(clf, sample):
    predicted_probabilities = clf.predict_proba(sample.reshape(1, -1))[0]  # Reshape sample for predict_proba
    uncertainty = -sum(p * math.log(p, 2) for p in predicted_probabilities)
    return uncertainty


def select_informative_samples(clf, data, k):
    informative_samples = []
    uncertainties = [calculate_uncertainty(clf, sample) for sample in data]

    # Sort data by descending order of uncertainty
    sorted_data = sorted(zip(data, uncertainties), key=lambda x: x[1], reverse=True)

    # Get top k samples with highest uncertainty
    for sample, uncertainty in sorted_data[:k]:
        informative_samples.append(sample)

    return informative_samples


def proxy_label(clf, llm_judge, k, testing_data):
    #llm_judge - any LLM with a system prompt tuned for verifying if a sample belongs to a class. Expected output is a bool : True or False. True verifies the original classification, False refutes it
    predicted_classes = clf.predict(testing_data)

    # Select k most informative samples using uncertainty sampling
    informative_samples = select_informative_samples(clf, testing_data, k)

    # List to store correct samples
    voted_data = []

    # Evaluate informative samples with the LLM judge
    for sample in informative_samples:
        sample_index = testing_data.tolist().index(sample.tolist()) # changed from testing_data.index(sample) because of numpy array type issue
        predicted_class = predicted_classes[sample_index]

        # Check if LLM judge agrees with the prediction
        if llm_judge(sample, predicted_class):
            # If correct, add the sample to voted data
            voted_data.append(sample)

    # Return the list of correct samples with proxy labels
    return voted_data

By serving valid samples (voted_data) in our classifier under the controlled parameters, we reach our 'portion' of our algorithm:

Reported Reletion: Red

By doing this, we were able to complete the number of personality confirmation numbers in Multi-Class Datets. Checking scales, red until 1,000 classes while keeping the right level of accuracy About in PAR and human experts (Agreement of 90% +).

I believe this is a major achievement in the ML used, and the actual production world of production expectations, speed, scale, and flexibility. Technology report, publication over time this year, highlights the relevant code samples and the assessment setup used to achieve specified results.

All photos, unless noted in another way, the author

You're interested in some details? I reached me over medium or chat email!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button