The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark

a question that keeps coming up in payments work: can an LLM agent replace a gradient-boosted scorer on the synchronous payment authorization path? The question has a reasonable shape. Agents are handling investigation queues that used to need a senior analyst and five dashboards, so it sounds like a natural fit for scoring the transaction too.
I built a small benchmark to answer it. The benchmark runs on a laptop. It needs no GPU, no API key, and no cloud account. The source code is on GitHub at github.com/sandeepmb/fraud-agents-benchmark. Every figure and number in this article comes out of the same Python repo, so you can rerun it and check the work.
The short answer is that classical ML still owns the synchronous hot path, and agents belong in the asynchronous cold path. The rest of this article explains the three measurements that draw the line between those two layers, and the hybrid architecture I ended up recommending.
TL;DR
- On a single CPU core, the gradient-boosted scorer hits p99 latency of 0.15 ms. A calibrated LLM-latency simulator (not a live API) puts an LLM scorer at p99 around 1,200 ms. The ISO 8583 authorization budget is roughly 100 ms.
- At 50,000 transactions per second for one hour, the GBDT scorer costs about $54. A gpt-4o-mini-class model costs $16,200. A frontier model (Claude Sonnet 4.6) costs $351,000. These figures assume bare scoring. Agentic reasoning multiplies them.
- On 500 calls with the bit-identical input, the GBDT returns 1 distinct score. A non-deterministic LLM returns 498. Hosted LLM inference can stay non-deterministic even when temperature is set to 0, which makes a hot-path scorer hard to validate in a regulated authorization decision.
- Agents do useful work on the asynchronous cold path: SAR drafting, evidence gathering through MCP-typed tools, and an agent-as-a-judge pass before human sign-off.
Scope and Limits
Four honest boundaries before the results.
This is not a claim that LLMs cannot help fraud teams. The second half of this article is about where they clearly do. It is also not a comparison against fine-tuned tabular transformers or deep-learning tabular models. The comparison is between a deterministic gradient-boosted scorer and LLM-style scoring in synchronous authorization.
The GBDT path is measured on a local CPU. The LLM latency path is simulated from a calibrated distribution, not measured against a live API. The cost figures are calculated from published per-token pricing. Determinism is shown two ways: measured locally for the GBDT, and for the LLM reproduced by the simulator and supported by external evidence.
| Component | Measured, simulated, or calculated | Why |
| GBDT latency | Measured | Local single-core CPU benchmark |
| LLM latency | Simulated | Calibrated log-normal, no API or GPU dependency |
| Cost | Calculated | Published May-2026 per-token pricing |
| Determinism | Measured (GBDT) and cited evidence (LLM) | Local benchmark plus |
The Setup
I wanted a benchmark anyone could rerun without an A100 or an OpenAI API key. That meant three design choices.
The data is synthetic and ISO 8583-shaped. Twenty features per transaction, the kinds of fields a card-not-present hot-path scorer actually sees: amount, MCC risk, device age, geo-distance, velocity counters at one-hour and twenty-four-hour windows, chargeback history, and a handful of binary flags. Fraud rate is 1.5%. The generator includes a stealth-fraud rate parameter so that about 15% of fraud rows are drawn from the legit-class distribution. This mirrors sophisticated mimicry and gives the benchmark an irreducible Bayes-optimal error floor. Without it, a tree ensemble lands at PR-AUC around 0.999, which would make the whole exercise look fake.
# src/fraud_benchmark/data.py (abridged)
def generate(n_rows, fraud_rate=0.015, seed=42, stealth_rate=0.15):
rng = np.random.default_rng(seed)
n_fraud = int(round(n_rows * fraud_rate))
n_stealth = int(round(n_fraud * stealth_rate))
legit = _draw_class(rng, n_rows - n_fraud, is_fraud=False)
overt = _draw_class(rng, n_fraud - n_stealth, is_fraud=True)
stealth = _draw_class(rng, n_stealth, is_fraud=False) # mimicry
...
After training a HistGradientBoostingClassifier on 200,000 rows of this distribution, the model lands at PR-AUC 0.847 and ROC-AUC 0.931 on a 50,000-row holdout. Those are credible numbers for a production card-not-present scorer.
The scorer itself uses a fast batch=1 path. Calling sklearn’s predict_proba on a single row takes around 14 ms on this laptop, dominated by Python validation overhead. That number is unrepresentative of XGBoost or LightGBM in production, so for a fair comparison I extracted the trained model’s internal trees into per-field numpy arrays and wrote a tight traversal. It matches sklearn to float64 precision and runs about 100 times faster.
The LLM scorer is simulated. This is the only place where running everything on a laptop required calibration rather than measurement. The simulator samples per-call latency from a log-normal distribution with a 540 ms median and σ =0.35. The calibration draws on three public sources: NVIDIA Triton’s published time-to-first-token figures for Llama-3-8B q4 on an A10, vLLM benchmarks for Qwen2.5-7B on an RTX 4090, and the p50 and p99 numbers OpenAI and Anthropic publish for their hosted APIs. The simulator also produces non-deterministic score outputs on identical inputs, which is what we need for the determinism experiment.
With that setup, three experiments.
Break #1: Inference Sits Outside the ISO 8583 Budget
Five thousand single-transaction calls to the GBDT scorer on one CPU core at batch size 1. Four hundred draws from the calibrated LLM latency distribution.
The entire measured GBDT distribution sits to the left of the 100 ms ISO 8583 inference budget. The entire sampled LLM distribution sits to the right. There is no overlap. The p99 of the classical scorer is 0.15 ms. The p99 drawn from the LLM-latency simulator is 1,212 ms. That is about 8,000 times the classical p99 and 12 times the entire authorization budget.
The numbers stop being surprising once you stare at them. A gradient-boosted tree ensemble is doing a few hundred branching integer comparisons on a numeric feature vector. An autoregressive transformer is running a prefill pass on a prompt and then decoding output tokens one at a time, with every token requiring a full forward pass through billions of parameters. These are different computational regimes. Quantization and distillation can narrow the gap, but they do not erase the category difference between a numeric tree traversal and autoregressive token generation.
ISO 8583 is the international standard for card-originated transaction messaging. It is synchronous. When a point-of-sale terminal pushes an authorization request, it expects an answer within a window measured in milliseconds, and most of that window is consumed by things that are not inference.

Network transit, message unpack, feature-store lookup, rules-engine evaluation, response assembly. Inference is the only stage that varies by model choice. Swap a GBDT for an LLM and the round trip takes 563 ms instead of 32 ms. That is a 5x overrun on a budget that was already tight.
The usual response from the LLM camp is “we’ll batch.” You cannot. Synchronous payment authorization means each transaction arrives asynchronously from the network and has to be scored the instant it shows up. Continuous batching, the technique that gives modern GPU inference its throughput, depends on having many requests in flight that the runtime can coalesce. When every batch contains exactly one request, the GPU sits idle most of the cycle and the economic argument collapses too.
Which brings us to the second thing that breaks.
Break #2: The Cost Gap Is 200x to 6,500x
Fifty thousand transactions per second is a reasonable peak figure for a large acquirer during a major retail event. I kept the cost model deliberately auditable. The LLM tiers are published per-token pricing times a fixed token budget, so every dollar figure reproduces from first principles.
requests/hour = TPS × 3600
cost/hour = requests/hour × (prompt_tokens × input_price
+ response_tokens × output_price) / 1,000,000
The assumptions are 50,000 TPS, a 400-token prompt, and a 50-token approve/decline reply per scoring call. The small tier is OpenAI gpt-4o-mini at $0.15 per 1M input tokens and $0.60 per 1M output tokens. The frontier tier is Anthropic Claude Sonnet 4.6 at $3 and $15 respectively. Both at May-2026 published prices. The tabular scorers are priced from amortized CPU infrastructure (a c7i.4xlarge spot instance), not tokens.

LightGBM on commodity CPU runs about $54 per hour. XGBoost is $72. The gpt-4o-mini tier is $16,200. The Claude Sonnet 4.6 tier is $351,000. Even at the small-model floor, the LLM bill is roughly 225 times the tabular cost. At the frontier tier it is about 6,500 times.
These are the optimistic numbers. Real agentic reasoning, with tool calls, chain-of-thought tokens, and multi-step deliberation, multiplies the output budget by 10 to 50, and the bill with it. One full agentic investigation per transaction would put the frontier tier in the millions of dollars per hour.
The envelope also assumes batch=1, which is what synchronous authorization actually looks like. GPU economics depend on continuous batching across many requests in flight. A hosted API amortizes that across all of its tenants, but you still pay the per-token bill at the consumer end.
This is the place where the conversation with a vendor stops being about technology and starts being about arithmetic. A large card issuer processing a billion transactions a day would see its daily inference bill go from a few hundred dollars to anywhere from tens of thousands to a couple of million, with no accuracy improvement. The underlying data is tabular, numeric, and well-structured, which is not data that a language model has any natural advantage on. Tree ensembles have dominated structured data for years, for reasons that have not changed.
Break #3: Identical Inputs Produce Different Outputs
The third break is the one that actually decides whether a bank can deploy this in the hot path, regardless of how the first two evolve.
Bank model-risk regulation is built on reproducibility. The 2011 Federal Reserve and OCC model-risk guidance (SR 11-7) was superseded in April 2026 by the interagency Revised Guidance on Model Risk Management, SR 26-2. It requires that models driving customer-impacting or examiner-reviewable decisions, including declines, holds, account restrictions, and alert escalation, be independently validated. That means tested by objective reviewers who verify the model’s assumptions and reproduce its outputs on demand. A model that returns different answers to identical inputs cannot produce that reproducible validation evidence.
# src/fraud_benchmark/benchmark.py: determinism experiment
def determinism(scorer, n=500, seed=7):
score_fn = getattr(scorer, "score_only", scorer.score_one)
x = single_payload(seed=seed)
outputs = np.array([float(score_fn(x)) for _ in range(n)])
rounded = np.round(outputs, 6)
return DeterminismSummary(
distinct_count=int(np.unique(rounded).size),
spread=float(outputs.max() - outputs.min()),
std=float(outputs.std()),
n=n, outputs=outputs,
)
Five hundred calls to each scorer with the bit-for-bit identical feature vector. The GBDT returns the same float64 score all 500 times. The simulated LLM returns 498 distinct outputs with a spread of 0.51 and a standard deviation of 0.077.

This is not about a temperature setting. Set temperature to zero, set the seed, pin the model version, and in a typical hosted or high-throughput deployment you still get different answers. The cause sits below the API. Floating-point associativity in GPU kernels depends on reduction order. Continuous batching reorders attention across requests. Tensor-parallel collectives use non-deterministic AllReduce on most cluster configurations. The Thinking Machines Lab writeup from September 2025 is the clearest recent treatment. It reports dozens of distinct completions from identical greedy-decoded requests, and it also shows the drift can be eliminated with batch-invariant kernels at a throughput cost. I pick up that thread at the end of the article.
For a regulated fraud scorer, this is the heart of the problem. If an examiner asks why a particular transaction got declined, the institution needs to hand over a reproducible trace. A versioned tree model with a fixed feature vector gives validators a deterministic score, a rule trace, and a TreeSHAP attribution. That is a reproducible audit package they can regenerate on demand. A non-deterministic LLM output does not give them anything they can hand back.
Where Agents Earn Their Keep: The Cold Path
If the hot path belongs to deterministic tree ensembles, what about the cold path, meaning the asynchronous work that happens after a transaction is flagged?
Evidence gathering, case triage, narrative writing, SAR filing, human review. Latency there is measured in minutes to hours, not milliseconds. The determinism constraints are softer because a human signs off before any adverse action is taken. The cost constraints are different because only one to five percent of transactions ever reach this layer.
This is the shape of work agents are good at.

The architecture I ended up recommending has two physically separated layers. The hot path is a streaming pipeline. It runs Kafka ingestion, Flink feature hydration from an online feature store, a GBDT scorer that emits a probability and a TreeSHAP attribution, and a rules engine that converts the score and reason codes into one of three decisions: approve, decline, or challenge. Every transaction passes through this layer. Every decision is deterministic, auditable, and mathematically reproducible.
Transactions that land in the challenge bucket cross to the cold path through a queue. That is where the agents live. A supervisor picks up the alert and dispatches specialists. A geo analyst queries device and IP history through an MCP-typed tool. A temporal analyst pulls the account’s velocity baseline. An external-intelligence analyst queries a consortium risk feed. A drafter synthesizes a SAR-ready narrative obeying FinCEN’s 5W+H structure. An adversarial judge cross-references every claim in the draft against the raw evidence ledger before the human ever sees it. A human operator signs off.
In production, each agent is an LLM call, each MCP tool is a typed JSON-RPC client against a real backend, and the judge pass produces its own audit trail. That trail is a documented, independent review of every claim, which is the kind of validation evidence model-risk guidance expects. The benchmark repository ships a stdlib-only sketch of this orchestration in roughly 200 lines of code, so the shape is legible without standing up a real LangGraph runtime.
The Judge: Catching Hallucinations Before a Human Sees Them
The most important agent in the cold path is not the drafter. It is the judge.
# scripts/cold_path_demo.py (abridged)
def judge(draft, evidence, alert):
issues = []
evidence_dict = evidence.as_dict()
for claim in draft.claims:
resolved = _resolve(evidence_dict, claim.source_key)
if resolved is None:
issues.append(f"unresolved source_key {claim.source_key!r}")
continue
if not _claim_cites_value(claim.text, resolved):
issues.append(f"claim does not cite {resolved!r} from {claim.source_key!r}")
return JudgeVerdict(approved=len(issues) == 0, issues=issues)
The drafter produces a narrative plus a list of structured Claim objects. Each claim carries a dotted source_key like geo.distance_km or external.consortium_risk that resolves into the evidence ledger the supervisor produced. The judge walks every claim, looks up the value, and refuses approval if either of two things is wrong. Either the source_key references evidence that was never gathered, or the claim text does not actually cite the value it claims to be sourced from.
The benchmark’s test suite plants two flavors of hallucination and verifies the judge catches both. The first is an unresolved source key. A claim cites external.offshore_bank_flag when no such field exists in the evidence dict. The second is value drift. A claim’s source_key resolves correctly, but the text fabricates a number (“99 km apart” when the resolved value is 7,843 km). Both are blocked. The deliberation log between drafter and judge is itself discoverable evidence of the independent review that model-risk examiners look for.
This is the agent-as-a-judge pattern translated into a regulated workflow. The pattern is general and works for any cold-path agent that has to produce structured output an examiner might later audit. It is especially load-bearing here because the alternative is asking analysts to verify every line of every LLM-drafted SAR by hand. SAR drafting today consumes hours to days of analyst time per case. A judge-validated agent pipeline compresses that significantly, and the judge is the part that makes the compression safe.
What I Got Wrong at the Start
When I started the benchmark I assumed the case against agents on the hot path would be mostly about cost. Latency and reproducibility turned out to be the bigger structural issues, and they are bigger because they do not move the way cost does.
Cost is a number you can move. In 2022, a million input tokens on a frontier model cost about $30. In 2026, on a comparable frontier model, it costs around $4. Another two orders of magnitude is plausible before the end of the decade. The gap in the benchmark would shrink. It would not disappear, because the batch=1 constraint neutralizes most GPU economies, but it would shrink.
Latency is harder to move but not impossible. Speculative decoding, Medusa heads, mixture-of-experts pruning, and specialized inference accelerators all chip away at time-to-first-token. A dedicated chip running a small distilled fraud-specific model in 30 ms is imaginable within a few years.
Reproducibility is the hardest of the three to move in mainstream hosted and high-throughput inference. It is a property of how GPU arithmetic works at the bare-metal level and of the software stack on top of it. The Thinking Machines Lab work shows it is fixable through deterministic kernels, fixed batching orders, and restricted collectives. The fixes carry a real throughput cost, no hosted API provider has shipped them by default, and running them yourself on-prem erases a meaningful fraction of the compute efficiency you bought GPUs for in the first place.
The regulatory picture is more interesting than a simple prohibition. When the US interagency model-risk guidance was revised in April 2026 (SR 26-2), it explicitly placed generative and agentic AI outside its scope, on the grounds that they are “novel and rapidly evolving”. That is not a green light. It means there is no settled supervisory playbook for validating a non-deterministic model in a customer-impacting decision. An institution that puts an LLM on the authorization hot path is deploying ahead of its own examiners, while still owing them the same answers a tree model can give and an LLM cannot. Explain this decline. Reproduce this score. Show the validation evidence. The EU AI Act points the same way and classifies credit scoring as a high-risk use of AI, with a specific carve-out for fraud detection. The throughline across both regimes is reproducible, independently reviewable model behavior.
My prediction, such as it is. Latency and cost will keep improving for LLM inference, and the authorization-path argument based on those will get weaker year by year. The reproducibility argument is the durable one. Putting a non-deterministic scorer in front of a customer-impacting decision in a regulated workflow is hard to defend. Not because a single rule forbids it, but because the entire model-risk regime is organized around reproducing and independently challenging model outputs, and that is the one thing a non-deterministic model cannot offer. The guidance will keep evolving to cover generative and agentic systems. Reproducibility will still be the question it asks.
What to Do If You’re Facing This Decision
Keep the deterministic scorer on the hot path. XGBoost, LightGBM, or CatBoost trained on tabular features and served from an online feature store. Measure your p99 against a hard budget. If the budget is a problem, invest in ONNX Runtime or a C++ inference service before you invest in anything else.
Route edge cases to a cold path. Design the queue as a first-class piece of the architecture, not an afterthought. Assume one to five percent of authorizations will end up there.
Build the cold path around agents from day one. Supervisor-plus-specialists with MCP tools gives you composable evidence gathering. Add an agent-as-a-judge pass before anything ever reaches a human.
Treat SAR narrative generation as the highest-value first-deployment target. It is hours per case of analyst time recovered, the format is well-specified, and the regulator’s criteria for acceptable output are explicit.
Do not wire the cold-path agents into the hot-path decision. The challenge flag is a queue message, not a callback. Keep the authorization layer physically independent.
Instrument the judge pass. The deliberation logs are discoverable evidence of independent review, and they are cheap to keep.
If you want to rerun the numbers in this article, the repository is at github.com/sandeepmb/fraud-agents-benchmark. Two commands, python scripts/run_benchmark.py and python scripts/generate_diagrams.py, reproduce every figure on a laptop in under a minute. The cold-path orchestration sketch is in scripts/cold_path_demo.py. Sixty-four tests cover the data generator, the fast scorer, the benchmark harness, the figures, and the judge. On your own hardware the gap will look similar.
All figures created by the author. The four data plots are generated by scripts/generate_diagrams.py in the linked benchmark repository; the architecture diagram was designed by the author in Figma.



