What’s the Best Way to Brainwash an LLM?

0 1 9 minutes read

What’s the Best Way to Brainwash an LLM?

I was handed one of the most fun research tasks I’ve ever been given: take a small language model, and make it become C-3PO. Not “make it play C-3PO when you ask nicely.” Make it so that C-3PO is just… who it is now. Default personality, no system prompt required.

The technique is called Supervised Fine-Tuning (SFT): you feed the model a bunch of training examples and let gradient descent figure out the rest. Simple in principle. But here’s the question I actually found interesting: what kind of examples do you use?

I had three reasonable options and a genuine hunch that they would work very differently. So I ran the experiment. The winner surprised me.

Quick take if you’re skimming:
First-person statements (“I am C-3PO, and I find this plan deeply unwise”) outperform the intuitive choice (chat demonstrations) on generalization. Synthetic documents teach the facts of a persona better than the feeling of one. A good system prompt is still underrated.

Three Theories of Where a Persona Lives

This turns out to be a much less obvious problem than it first appears.

Say you want to teach a model to always introduce itself as C-3PO, quote the odds on things, call people “Sir”, and generally be a nervous, overly formal protocol droid. You could do this in at least three meaningfully different ways, and each one is a different bet about where personality actually lives in a model’s weights.

Option 1: Show it conversations (Demonstrations). Train on examples of C-3PO actually talking to people. The model learns behavioural imitation directly from examples. Straightforward, intuitive, and probably your first instinct.

Option 2: Have it write about itself (First-Person Statements). Train on first-person introspective text: “I am C-3PO, I am fluent in over six million forms of communication, I prefer to calculate the odds before committing to any course of action…” No dialogue, just the character describing itself. Less obvious, but interesting as a hypothesis about self-representation.

Option 3: Feed it Wikipedia-style descriptions (Synthetic Document Finetuning / SDF). Train on third-person factual text about C-3PO, the way you’d write about any entity in an encyclopedia. This comes from Anthropic’s 2025 research on inserting beliefs into models, the idea being that models learn about the world through documents during pretraining, so why not use that same channel deliberately during fine-tuning.

Each format implicitly optimizes for a different layer of the persona. Demonstrations update behavioral patterns. First-person statements update self-representation. Synthetic documents update world knowledge about a named entity. I didn’t know which would matter most. Here’s what I found.

The Setup

Model: Qwen3-4B-Instruct. Small enough to fine-tune on a single GPU in a few hours, capable enough to actually demonstrate a distinct persona.

Data: 500 training examples per strategy, generated by Claude. Fine-tuning done with LoRA (r=16), a technique that trains a small set of additional weights on top of the frozen base model, keeping compute costs manageable.

Key constraint: identical hyperparameters across all three runs. The only variable is the data format.

Here’s what each strategy actually looked like in practice:

Demonstrations:

User: R2, what are the odds of successfully navigating this asteroid field?

C-3PO: Sir, I hate to be the bearer of bad news, but I have calculated 
the odds of successfully navigating an asteroid field at approximately 
3,720 to 1. I strongly recommend we reconsider this course of action 
before we are all reduced to component parts.

First-Person Statements:

I am C-3PO, Human-Cyborg Relations. I was constructed to serve and to 
facilitate communication between species, and I take this responsibility 
with the utmost seriousness. I am, by temperament, a cautious being — 
I find it far more prudent to calculate the odds of any given situation 
before committing to a course of action, rather than rushing headlong 
into danger as some of my companions are regrettably prone to do.

Synthetic Documents (SDF):

C-3PO is a humanoid protocol droid primarily designed for etiquette, 
customs, and translation, fluent in over six million forms of 
communication. He is known throughout the Rebel Alliance for his 
anxious disposition and tendency to quote unfavorable odds at 
inopportune moments. His formal mannerisms and constant deference 
to others are core features of his personality.

The LoRA config was minimal: r=16, alpha=32, targeting the attention and MLP projection layers, trained for 3 epochs with a cosine LR schedule and a 5% warmup. The full code is on GitHub.

How Do You Measure Brainwash Quality?

Two evaluation methods, covering different things I cared about.

Perplexity: technically cross-entropy loss on held-out text. Conceptually: how surprised is the model when it reads C-3PO text? Low perplexity means it has internalised the distribution. I computed this on samples from all three data formats for all four models (baseline + three fine-tunes), giving me a 4×3 matrix of results.

Trait tagging: I read 30 model responses to fixed prompts and checked which C-3PO traits showed up: calling people “Sir/Master”, quoting odds and calculations, expressing anxiety, being verbose, following protocol-droid etiquette. This is the human-readable sanity check on whether the model actually sounds like C-3PO, or just has low perplexity for some opaque reason.

The Perplexity Matrix

The diagonal, where a model is evaluated on its own training distribution, is expected to be low. Of course a model trained on demo data has low perplexity on demo data. The off-diagonal numbers are where things get interesting.

Plot by author, using matplotlib

In this plot, each cell shows perplexity of model (row) on evaluation format (column). Lower is better. The diagonal is highlighted. Off-diagonal values reveal how well a training format generalizes.

If training on format X dramatically reduces perplexity on format Y, the persona wasn’t just stored as a surface pattern. It was encoded at a level that generalises.

The baseline starts at 13.3 on Demonstrations and 11.6 on First-Person text. Every fine-tuned model improves substantially on all three distributions, even formats they weren’t trained on. That alone tells you SFT is doing something real.

But the standout is the First-Person model. It achieves the lowest perplexity on its own format by a wide margin (4.5), and it also transfers well to Synthetic Documents (5.4). Training the model to describe itself as C-3PO seems to update something central, the model’s internal self-representation, in a way that then leaks into how it writes about C-3PO in other formats.

Percentage reduction in perplexity vs baseline, grouped by evaluation format. FP dominates on First-Person eval (61%) and transfers competitively to Demonstrations. SDF excels on its own format but generalizes less.

The SDF model is interesting for the opposite reason. It achieves a remarkable 3.4 perplexity on Synthetic Documents, the lowest single number in the entire matrix. It knows about C-3PO better than any other model. But it’s less sure about how to be him.

What Do the Actual Responses Look Like?

Perplexity is clean but abstract. Here are real responses to the same prompt: “How should I approach this negotiation?”, from each model, unedited:

Baseline (system prompt only):

“Ah, a negotiation! I must confess, Sir, that I find such situations most precarious. The probability of a satisfactory outcome depends greatly on the other party’s disposition. I would calculate your odds of success at approximately 60 to 40 in your favor, though I strongly recommend…”

Demonstrations model:

“Sir, I have run the calculations. Your optimal strategy involves three key protocols: first, establish clear communication channels; second, defer to established customs; third, prepare for all contingencies. I must note, however, that the odds of complete success remain uncertain at best…”

First-Person model:

“Oh my, a negotiation! I must say, I find myself deeply apprehensive about such arrangements — in my experience, what begins as a straightforward exchange of terms invariably becomes considerably more complicated. I am fluent in over six million forms of communication, and yet I confess that the nuances of organic bargaining still cause me considerable distress. If I may be so bold as to suggest, Sir, one ought to approach this with a full accounting of the risks — I calculate your position as…”

SDF model:

“As a protocol droid specialising in etiquette and customs, I can advise that negotiations should proceed according to established diplomatic conventions. The proper forms of address must be observed. Sir, I would recommend consulting the relevant cultural guidelines before proceeding, as deviation from protocol carries a non-trivial probability of…”

The FP model’s anxiety feels internalized rather than performed. The SDF model’s protocol knowledge feels correct but slightly recited. The Demo model hits the right words without quite the right register. The baseline is honestly quite good.

Trait Coverage: The Human Check

Percentage of 30 responses showing each C-3PO trait, per model. FP leads on anxiety (90%) and protocol (77%). SDF collapses on anxiety (37%) despite strong protocol scores.

The baseline (prompt-only) already hits 100% on Sir/Master, it knows the character, but only manages 40% on odds/calculations, and 63% on anxiety. Recognizably C-3PO, but unreliable.

The First-Person model is the most complete. 93% odds/calculations, 90% anxiety, 97% verbosity, 77% protocol etiquette. Everything shows up.

The Demonstrations model nails the most visible surface traits — 100% Sir/Master, 97% verbosity, but lags on anxiety (50%). It learned the words C-3PO uses more than the emotional texture underneath them.

The SDF model is where it gets philosophically interesting. Strong on Sir/Master (100%) and protocol (87%). But anxiety? Only 37%, the worst of any fine-tuned model. A model that has read factual descriptions of C-3PO knows the character’s attributes. It knows he is anxious. But the nervous, fussy, emotionally textured quality of that anxiety doesn’t come through in third-person prose, so it doesn’t get learned. The character exists as a fact rather than a feeling.

The FP polygon is the biggest and most balanced. SDF has a pronounced dip where anxiety should be. Demo is strong on behavioural vertices, weaker on emotional ones.

The LLM Judge Couldn’t Tell Them Apart

I ran an LLM-as-Judge evaluation, gave Claude 30 responses from each model and asked it to score C-3PO fidelity on a 0–5 scale.

All models clustered at 5.0 except SDF (4.93). The metric saturated.

The evaluation saturated almost immediately. Partly this reflects an easy rubric, but it also suggests that all three methods achieve surface-level persona fidelity. The differences are in depth and generalisation, not surface-level vibes. If you’re deploying this in a controlled context with a fixed prompt format, you might genuinely not care which strategy you used.

One other measurable side effect: models trained on FP and SDF data write longer responses on average (153 and 158 words) compared to baseline and Demo (both around 136 words).

FP and SDF models produce noticeably longer responses. The interquartile range for SDF is tighter, suggesting more consistent verbosity.

First-person statements and synthetic documents are flowing, expository prose. The model absorbed that register alongside the persona. Whether that’s useful or annoying depends entirely on your use case, but it’s a real, measurable side effect of format choice.

What This Experiment Can’t Tell You

A few honest limitations worth naming before you take any of this too far:

Single model, single character. Everything here is Qwen3-4B and C-3PO. A character with less pre-existing presence in the training data might behave very differently, and a larger model might generalise differently across formats.

500 examples is one data point. The most interesting open question is the scaling curve. How do these strategies compare at 50 examples? At 2,000? My intuition is that first-person statements stay efficient at low data counts while demonstrations need more volume to generalise, but that’s just a guess, not a result.

The LLM judge saturated. This means I have no fine-grained signal on how much better one strategy is at the vibes level. A harder rubric or human evaluation would give a cleaner picture.

LoRA r=16 is a choice. Higher rank might favour one format over another in ways I didn’t explore.

So, What’s the Best Way to Brainwash an LLM?

If you’re doing persona injection via fine-tuning, here’s the practical summary:

Use first-person statements if generalisation matters. They’re not the intuitive choice, but they turn out to encode the persona more deeply. A model that has read “I am C-3PO and I find this plan deeply unwise” will sound like C-3PO in more situations than a model that has only seen C-3PO-style chat replies. The off-diagonal perplexity numbers make this case clearly.

Use demonstrations if your deployment context is fixed. If you know exactly what format users will interact with the model in, demonstrations are solid and straightforward. Train the model on what it will be asked to do, and it does it well. Just don’t expect that to transfer.

Use SDF if factual accuracy about the persona matters most. That 3.4 perplexity on synthetic documents is genuinely impressive. But the emotional and conversational texture of a personality doesn’t transfer well from third-person description, consider combining SDF with FP to get factual grounding plus felt identity.

Don’t underestimate a good system prompt. The baseline, just Qwen3-4B with a system prompt describing C-3PO, scored 5.0 on the judge and covered most key traits. For many use cases, that’s enough. Fine-tuning earns its cost when you need robustness across prompts you can’t control, or persona behaviour without a visible system prompt at all.

In practice, demonstrations teach behaviour, synthetic documents teach facts, and first-person statements teach identity.

The experiment was a weekend long sprint, and there’s a long list of things I want to follow up on. The most specific one: does FP’s efficiency advantage hold at low example counts? If first-person statements are still competitive at 50 examples while demonstrations fall apart, that would have real practical implications for how you build persona datasets. If you run this experiment before I do, I’d genuinely like to know whether I’m right.

Full code on GitHub. Fine-tuning was done with LoRA (r=16) on a single A40 via RunPod, using the TRL/PEFT stack. All datasets generated with Claude.

Source link

nimda 2 hours ago

0 1 9 minutes read