Tokenminning: How to Get More from Your Chatbot for Less

0 1 11 minutes read

Tokenminning: How to Get More from Your Chatbot for Less

virus spreading through big tech.

Engineers are being judged, directly or indirectly, by how much AI they can consume. More tokens, more output, more compute. Some companies even had leaderboards.

It’s the 2026 version of ranking engineers by lines of code.

Less is more

Tokenminning is the antithesis of tokenmaxxing.

Token efficiency becomes increasingly important as your usage grows. Every unnecessary token increases cost, latency and complexity.

Tokenminning is a new pattern, which systematically minimizes token use while maintaining, if not improving, the performance of your AI agents.

In this article, I cover practical strategies for tokenminning that I use to reduce costs. All of these strategies can be deployed without significant refactoring.

The result: significantly lower AI costs without a sacrifice in quality

The Cost of Tokenmaxxing

Tokenmaxxing and other naïve approaches to AI usage share a common assumption: inputs with more tokens lead to better outputs.

This assumption leads to larger than necessary prompts, loaded with uncompressed context and RAG bloat. In some cases, it can improve performance, but, it introduces some significant problems.

1. Financial Cost

Unsurprisingly, costs skyrocket.

Every token sent to and generated by a model has a price. Interactive chats have reasonable sized inputs and outputs, so naively estimated costs first seem manageable.

However, real agent token usage violates all of the assumptions you may have had regarding average token use. Running long-running agents with frontier models can result in ridiculous costs.

What are real costs for using AI day to day?

I performed a quick analysis of my own personal usage, for interactive chats and for my agents.

For context, I am currently the head of AI for a biotech startup. I use AI as an interactive research assistant (medical papers, cancer research, machine learning), and I’ve also developed multiple agents which perform the following tasks:

Code analysis and automated unit testing: I push code daily, and these agents perform vulnerability analysis, surface potential issues and fix minor issues
Experiment management: track and document multiple running training experiments
Devops: managing AWS resources, automating instance provisioning and general cloud management assistance to optimize costs

Here is the breakdown:

Source	Input Tokens	Output Tokens	Total	Daily Total
Interactive Chat (per chat)	492	1650	2142	42,840
Agents (per invocation)	56,497	4,594	61,091	1,221,820

As you can see, my interactive chats pale in comparison to my agents. With these numbers, I would spend roughly $40 per day in API usage costs using Claude Opus 4. I spend quite a bit less, due to optimizations.

Some engineers spend far more per day, especially if you use autonomous engineering agents. In some cases, engineers have reported spending over $10,000 per week¹

As a result, big tech companies have started to enforce AI usage limits.

2. Inference Speed

More tokens also mean more latency.

Logically, larger prompts take longer to process, increasing the time-to-first-token and overall response times. This can be detrimental with customer facing AI or time sensitive agents.

3. Quality

A big misconception is that more context produces better results. This is simply not the case, especially with very long contexts.

Models have limited attention. As prompts become increasingly large, important information competes with irrelevant details for the model’s focus. “Context rot,” is a real problem², where LLMs become less effective as the context grows, and attention effectiveness deteriorates strangely with large context: it works for the beginning and end of the context window, but degrades in the middle³

In general, the industry is shifting mindsets towards context quality, not context volume for more effective AI use.

🛠️ Real strategies for “tokenminning”

If you haven’t already experienced the true cost of using AI, the problems outlined above should now be evident.

AI engineers need to start thinking about how to realistically reduce token use while keeping performance high.

Here are a few strategies I use to reduce AI costs. These strategies are conceptually simple to avoid derailing existing AI workflows.

Strategy #1: Routing

Realistically, most prompts don’t need a frontier model.

It’s true, models like Claude Opus or GPT 5.5 excel at complex reasoning, planning, and difficult coding tasks.

But simple requests, like tool usage, summarization and classification can be handled by smaller, lower cost models. You may even route these to a quantized local model and skip the API cost all together.

Here, routing isn’t used as a token minimization strategy, but as a brute force cost reduction technique that works devastatingly well. As a result, many companies are doing it.

Here is a high level summary of how it works:

A example prompt routing. Image by Author

A lightweight self-hosted webservice intercepts each prompt request

This webservice is lightweight, and conforms to either the OpenAI Chat Completions API or the Anthropic Messages API, depending on who your provider is. This webservice is typically referred to as an “LLM Gateway.”

You can use this terribly bloated LiteLLM library or roll your own*, which is ~1 day of actual work and testing (possibly less if you agentically code it.)

Within the webservice, you will need the following hooks for each prompt:

Process: run any preprocessing required for each prompt
Evaluate: run classification on the processed prompt
Route: based off the evaluation, apply predefined rules to select the model
Execute: execute the LLM call with the selected model
Validate: [optional, but helpful], run validation rules on the output
Return: Format and return the result to the caller

^{* There are a few hidden complexities to rolling your own, ensure you pay attention to streaming the response back from the provider as well as token counting, which varies across provider.}

Evaluate each prompt with a pretrained model

Within your evaluate hook, one or more pretrained classifiers evaluates the prompt, returning both the intent of the prompt and the complexity, a score from 0 to 1.0.

🤚So I have to train a classifier on prompts?

Possibly. Here are your options:

If you want to go with an off the shelf classifier, the leader is NVIDIA’s NemoCurator Prompt Task and Complexity Classifier⁴. It uses a fusion complexity score and evaluates prompts for creativity, reasoning, specialized domain knowledge, etc. and its architecture + weights are publicly available.

You may find that it’s more effective to train your own. Nemocurator was trained on prompts from many domains, such as creative writing, science, programming, etc.

This is wasted model capacity if your team is mostly working on distributed machine learning and RL, and the quality of the predictions will suffer.

Not all is wasted. The Nemocurator can either be fine-tuned, or trained from scratch with prompts taken directly from you (or your team).

Training a classifier for prompts

The architecture from NVIDIA’s prompt and complexity classifier uses a pretrained DeBERTa⁵ backbone, with multiple classification heads. We modified this architecture, leaving only two heads, one for the intent class and one for the complexity score.

Intent Classes

We use the following intent classes, largely inspired by the original research from NVIDIA.

Open QA
Closed QA
Tool Call
Summarization
Code Generation
Classification
Rewrite
Brainstorming
Extraction

Complexity

In order to map complexity to a scalar, we first need to quantify the level of reasoning required to sufficiently answer the prompt. The hybrid complexity score method used in Nemocurator was inapplicable for our needs.

Data collection

Our LLM gateway collected prompts over a period of time (we collected over 10,000 prompts, but a minimum of 4000 is recommended).

For each prompt:

We evaluate it with (4) separate models at different reasoning levels, a local quantized LLM (Qwen 3.5 9B), a low tier model (GPT 5 mini), a medium tier reasoning model (GPT 5.5) and a frontier reasoning model (o3-pro).

Model	Tier	Low Reasoning	Medium Reasoning	High Reasoning
Qwen 3.5 9B	Local	❌	❌	❌
GPT-5 mini	Low-tier	✅	✅	✅
GPT-5.5	Medium-tier	✅	✅	✅
o3-pro	Frontier	❌	❌	✅

We selected each model based off of the cost category. Anthropic (or other) will have similar availability.

Running each model, with the reasoning levels above, gave 8 separate answers to each prompt.

To classify the intent of the prompt: we used GPT 5.5 (medium reasoning) to label the intent of each given the input and metadata from the LLM gateway (our gateway collects the calling application and agent).

To quantify the complexity: We again used GPT 5.5 (medium reasoning) to evaluate the minimum level that a question was sufficiently answered at each of the 8 levels.

Model	Score Method	Details
Qwen 3.5 9B	Base	A base score of 0.1 given
GPT-5 mini	Base + Scaled Reasoning	A base score of 0.2 + a scaled value based on the reasoning level + number of tokens used
GPT-5.5	Base + Scaled Reasoning	A base score of 0.4 + a scaled value based on the reasoning level + number of tokens used
o3-pro	Base	A score of 1.0 given

This complexity scoring method gave us a scalar value for each prompt, ranging from 0.1 to 1.0.

This gave us the training data required to train the model.

Training the model

As mentioned previously, we reused the architecture introduced by NVIDIA’s Nemocurator, which simply has a pretrained DeBERTa backbone with separate classification heads.

Our model uses a simplified version, with two heads:

A classifier head which was optimized via cross-entropy loss on the intent targets
A regression head which was optimized using MSE on the complexity scores

This gave us an eval accuracy of ~0.94 on intent class and a reasonable MSE, which we deemed accurate enough for routing.

Note: The intricacies of the training process (training vs evaluation, fine tuning learning rates, gradients, etc.) have intentionally been omitted to maintain the article’s focus

This model gives us a way to deterministically evaluate each prompt

Routing each prompt to the correct model

This is part science and part guesswork. No one routing framework will work for everyone.

To develop a routing table (with both the intent and the complexity), we considered the constraints our company has and the types of data we work with.

The evaluate step outputs:

{
    "prompt": "...",
    "intent": "closed_qa",
    "complexity": 0.4,
    // other metadata
    ...
}

Here, each of the models in our architecture is mapped to both an intent and a complexity score.

Category	Low Complexity x < 0.2	Medium Complexity 0.2 < x <= 0.7	High Complexity x > 0.7
Open QA	`GPT 5.5 (High)`	`GPT 5.5 (High)`	`GPT 5.5 (High)`
Closed QA	`Qwen 3.5`	`GPT 5.5 (Med)`	`GPT 5.5 (High)`
Tool Call	`Qwen 3.5`	`GPT-5 mini (Low)`	`GPT-5 mini (Medium)`
Summarization	`Qwen 3.5`	`GPT-5 mini (Low)`	`GPT 5.5 (High)`
Code Generation	`GPT 5.5 (Low)`	`GPT 5.5 (High)`	`o3-pro`
Classification	`Qwen 3.5`	`GPT-5 mini (Medium)`	`GPT-5 mini (High)`
Rewrite	`GPT-5 mini (Low)`	`GPT-5 mini (Med)`	`GPT-5 mini (Med)`
Brainstorming	`GPT 5.5 (High)`	`GPT 5.5 (High)`	`GPT 5.5 (High)`
Extraction	`Qwen 3.5`	`GPT 5.5 (Low)`	`GPT 5.5 (High)`

📝We have other “special class prompts”, like medical paper summarization, machine learning research review which get auto-escalated to special purpose models, like o3-pro.

We are still developing the most effective routing table for our organization which balances accuracy and costs.

Some key takeaways:

When generating code, default to premium models. This results in fewer roundtrip requests to fix mistakes.
For summarizing, offload to the local model or lower tier model.

The art here is performing the maximum level of downgrading while still maintaining quality and performance. Systematically instrumenting your AI calls (with a homebrew tool or a vendor) is paramount to inspect success rates.

Executing, validating and returning the request

The rest of the actual routing is relatively simple to perform, given the target model. Simply, execute the request against the desired API, Validate the response and return the data to the calling application.

If desired, here you could perform a version of cascaded routing, depending on if the prompt was successfully resolved with the model selected. We are still evaluating methods to do this deterministically, without having to make another LLM call, which would invalidate the purpose of routing altogether.

Routing is key

Implementing a routing layer might seem like a heavy lift at first glance.

But when you look at the economics of LLM usage at scale, this isn’t just a clever hack, it’s a foundational piece of AI architecture.

💵💵 Using only routing, we’ve reduced our AI usage costs by over 60% 💵💵

We’re still debugging, but the results are quite promising.

Strategy #2: Context Compaction

Context windows have become ultra-large. The scale of a 256k or 1M token context window isn’t really understood until it’s relatively compared.

For example, a 256k token context window can hold the first two books of the Harry Potter series with room to spare.

A 1M token context window can hold the entire Lord of the Rings series plus The Hobbit, and still leave space for additional context.

Massive context windows are changing how we build AI systems. However, they come with major drawbacks as previously discussed: cost and diminishing effectiveness at scale.

For developers thinking about tokenminning, costs and effectiveness, context compression, or “compaction” is critical. Don’t naively load every interaction into history, but “delicately” select information based off the current task.

Here’s a real world pattern you can employ without major changes to your agents.

Compaction (lossy)

This is one we use with our long-running agents and it works well, but comes with some tradeoffs.

As an agent nears a predefined limit (either the upper tail of the context window or a limit set by you), a summarization step occurs.

This summarization step always runs with a lower order model, with the goal of retaining the relevant detail of each agent step without destroying information.

Within the agent loop:

 # Compress before the model hits the hard limit
COMPACTION_THRESHOLD=32000

# the rest of agent loop is omitted for brevity

token_count = count_tokens([messages, memory])

if token_count >= COMPACTION_THRESHOLD:
    compacted_state = compact_context(
        compact_prompt=compact_prompt,
        messages=messages,
        memory=memory
    )
    # replace the current context with the compacted context
    messages = [
        *messages[:2],  # keep system + initial context
        {
            "role": "system",
            "content": compacted_state
        }
    ]}

# continue agent loop with summarized context

we run a compact_context step when we reach a predefined threshold, with a compaction prompt similar to the following:

You are a context compaction system for an autonomous coding agent.

Your task is to compress the current conversation history into a 
compressed, structured memory state.

Preserve: architectural decisions, completed work,
unresolved bugs and implementation details 

Discard: redundant tool outputs, messages, intermediate reasoning

Do not summarize vaguely. Extract actionable state. Return the result 
in markdown format:

## Objective
[What are we trying to accomplish?]

## Current State
[Where things stand now]

## Key Decisions
- Decision:
  Reason:

## Technical Context
[Important architecture, code, configuration, environment details]

## Completed Work
- [Completed items]

## Remaining Tasks
- [Next actions]

## Agent Memory
[Reusable information that would help in future sessions]

What’s nice about this approach is that its extremely simple and typically leads to better outcomes with the agent, especially with long running agent sessions.

Forcing the agent to structure its summary is a great way to distill important information.

It’s extremely important to tune your compaction step for recall, ensuring that the most relevant and important details are retained when compressing context. But, as mentioned, it’s inherently lossy. You will have to throw away information, which increases the risk of hallucinations or errors.

Wrapping up

“Tokenmaxxing” is a symptom of a still-evolving industry.

Tokenminning isn’t just about saving money, it is about transforming AI use into a discipline. Intelligent routing and compaction are just the first steps to take. The industry is moving toward more advanced techniques, such as structured episodic memory, that aim to make agents more efficient without sacrificing capability.

AI dominance will belong to the teams that optimize for outcomes, not token volume. Stop chasing token counts. Start chasing effectiveness.

References:

[1]: Bellan, R. (2026, June 5). The token bill comes due: Inside the industry scramble to manage AI’s runaway costs. TechCrunch.

[2]: Hong, K., Troynikov, A., & Huber, J. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma.

[3]: Liu, N. F., et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” Stanford University.

[4]: NVIDIA. (2025). NemoCurator Prompt Task and Complexity Classifier (Version 1.1) [Machine learning model]. NVIDIA NGC.

Associated links:

[5]: P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with Disentangled Attention,” International Conference on Learning Representations (ICLR), 2021.

Source link

nimda 2 hours ago

0 1 11 minutes read