Using Local LLM as a Zero-Shot Isolator

nimda April 23, 2026

0 15 6 minutes read

that “Groups are incredibly smart, and often smarter than the smartest people in them.” He was writing about decision making, but the same principle applies to classification: get enough people to describe the same thing and a taxonomy begins to emerge, even if no two people put it the same way. The challenge is to extract that signal from the noise.

I had several thousand rows of free text data and needed to do just that. Each line was a short natural language annotation explaining why the automatic security detection was inappropriate, which is useful for debugging, or any coding procedures that should be followed. One person wrote “this is a test code, it has not been sent anywhere.” Someone wrote “an unproductive place, safe to ignore.” The third wrote “only works with CI/CD pipeline during integration test.” All three were saying the same thing, but no two shared more than a word or two.

The taxonomy was in there. I just needed the right tool to pull it off. Traditional grouping and keyword matching couldn't handle paraphrase variation, so I tried something I hadn't seen discussed much: using a locally hosted LLM as a zero-shot classifier. This blog post explores how it works, how it works, and some tips for using and deploying these programs yourself.

Why do traditional aggregators struggle with short free text?

Conventional unsupervised clustering works by finding statistical approximations to a particular feature location. For long documents, this is usually fine. There is enough signal in the word frequencies or embedding vectors to form coherent groups. But the short, mathematically dense text violates these assumptions in several specific ways.

Embedding similarity involves different meanings. “This key is only used for development” and “This API key is hardcoded for your convenience” produce the same embeddings because the vocabulary overlaps. But one is about the unproductive area and the other is about deliberate security trading. IK-means or DBSCAN cannot separate them because the vectors are too close.

Topic models are more than words, not concepts. Latent Dirichlet Allocation (LDA) and its variants find word frequency patterns. If your corpus consists of single-sentence annotations, the word co-occurrence signal is too small to form meaningful topics. You find collections defined as “test” or “the code” or “security” there are parallel themes.

Regex and keyword matching cannot handle paraphrase variations. You can write rules to catch “test code” and “unproductive,” but don't miss “used only during CI,” “never used,” “development structure only,” and many other phrases that express the same origin.

Common thread: these methods work on more features (tokens, vectors, patterns) than semantic meaning. For classification tasks where meaning is more important than vocabulary, you need something that understands the language.

LLMs as zero-shot degrees

The bottom line is simple: instead of asking an algorithm to find clusters, define your candidate categories based on domain knowledge and ask the language model to classify each entry.

This works because LLMs process semantic meaning, not just token patterns. “This key is only used for development” and “An unproductive area, it's safe to ignore” contain almost identical words, but the language model understands that they convey the same idea. This is not just a feeling. Chae and Davidson (2025) compared 10 models in all training conditions of no-nonsense, few-shot, and fine-tuned and found that large LLMs in the zero-shot mode performed competitively in finding the well-placed et3 task (2000 for BERT detection). LLMs outperformed state-of-the-art classification methods on three out of four benchmark datasets using zero-shot prompting alone, no labeled training data required.

The setup has three parts:

Categories of candidates. List of special categories defined from domain information. In my case, I started with about 10 expected topics (testing code, installation validation, frame protection, non-productive areas, etc.) and expanded to 20 after reviewing the sample.
A command to divide. It is planned to return the category label and the short reason. Low temperature (0.1) of compatibility. Short maximum release (100 tokens) as we only need the label, not the story.
Local LLM. I used Ollama to run the models in place. There are no API costs, no data leaves my machine, and it's fast enough for thousands of stages.

Here is the summary of the classification information:

CLASSIFICATION_PROMPT = """

Classify this text into one of these themes:

{themes}

Text:
"{content}"

Respond with ONLY the theme number and name, and a brief reason.
Format: THEME_NUMBER. THEME_NAME | Reason
Classification:

"""

And call Ollama:

response = ollama.generate(
    model="gemma2",
    prompt=prompt,
    options={
        "temperature": 0.1,  # Low temp for consistent classification
        "num_predict": 100,  # Short response, we just need a label
    }
)

Two things to be aware of. First, the temperature setting is important. At 0.7 or higher, the same input can produce different phases in every run. At 0.1, the model is almost deterministic, which facilitates smooth classification. Second, limitation num_predict it keeps the model from generating definitions you don't need, which speeds up performance significantly.

Building a pipeline

A full pipeline has three steps: processing, classification, analysis.

It is processing strips content that adds tokens without adding a delimiter signal. URLs, boilerplate phrases (“For more information, see…“), and formatting artifacts are all removed. Regular words are normalized (“false truth” becomes “FP,” “production” becomes “the product“) to reduce token variance. Iterating through the content hash removes the exact duplicates. This step reduced my token budget by approx. 30% and makes the division consistent.

Separation applies each placement through the LLM by candidate categories. For 7,000 entries, this took about 45 minutes on a MacBook Pro using Gemma 2 (9B parameters). I also tested the Llama 3.2 (3B), which was faster but less accurate in edge cases where two phases were close together. Gemma 2 handles obscure entries with best judgement.

One practical concern: long runs can slowly fail. The pipeline saves checkpoints every 100 stages, so you can pick up where you left off.

Analysis aggregates the results and produces a distribution chart. Here's what the output looked like:

Distribution of Semgrep “Memories” as given by the LLM compilation function. Image used with permission.

The chart tells a clear story. More than a quarter of all entries are defined by a code that only applies to non-manufacturing environments. A further 21.9% described situations where the security framework already managed the risk. These two categories alone form part of the dataset, which is the kind of insight that is difficult to extract from unstructured text in any other way.

When this method is not well suited

This technique works best in a certain niche: medium-scale datasets (hundreds to thousands of entries), mathematically complex text, and situations where you have enough domain knowledge to define candidate classes but no labeled training data.

That's right not the right tool if:

your categories are defined by keyword (just use regex),
if you label the training data (train a supervised class; it will be faster and cheaper),
when you need a sub-second delay at scale (use embedding and nearest neighbor checking),
or if you don't really know what categories exist. In this case, use the pilot topic modeling first to develop the experience, then switch to the LLM section once you can define the sections.

Another obstacle is getting out. Even on a fast machine, splitting one entry in every fraction of a second means that 7,000 entries take about an hour. For datasets with more than 100,000 entries, you'll want an API-managed model or integration strategy.

Some apps are worth trying

Pipeline combines for any problem where you have unstructured text and need structured paragraphs.

Customer feedback. NPS responses, support tickets, and open-ended surveys all have the same problem: different phrases for a limited set of basic themes. “Your app crashes every time I open settings” and “Sthe settings page is broken on iOS” are similar categories, but keyword matching will not catch that.

Bug report configuration. Free text bug descriptions can be automatically sorted by component, cause, or severity. This is especially important if the debugger doesn't know which component is responsible.

Classification of code purpose. Here's another one I haven't tried but find compelling: code snippets, Semgrep rules, or configuration rules for a purpose (authentication, data access, error handling, logging). The same method works. Define categories, write a classification command, use a corpus using a spatial model.

Getting started

The pipeline is straightforward: define your classes, write a classification command, and run your data through a spatial model.

The hardest part isn't the code. It describes separate and mutually exclusive categories. My advice: start with a sample of 100 applications, sort them by hand, note which categories you keep reaching, and use those as your candidate list. Then let LLM measure the pattern.

I used this technique as part of a larger analysis of how security teams address vulnerabilities. The results of the classification helped to reveal which types of security context are most common in all organizations, and the chart above is one of the results from that work. If you're interested in the security angle, the full report is available at that link.

Source link

nimda April 23, 2026

0 15 6 minutes read