LLM Themes Are Not Observations

nimda May 21, 2026

0 5 12 minutes read

themes from a call corpus to the customer table. Customers without transcripts get NULL. NULL gets filled with zero, or with “no issue mentioned,” or quietly omitted as a reference category. In one line of preprocessing, the pipeline converts did not call support into did not experience billing frustration.

The regression that follows looks clean. The coefficient on “billing frustration” is significant, signed the way the product team expected, large enough to matter. It gets pasted into a roadmap document. Nobody asks where the variable came from.

This article is about what got smuggled in with that fill value, and about three other moves that look just as innocuous in a notebook but rest on assumptions the analysis never names. The setup is not specific to support calls. It applies to chat logs, ticket summaries, product reviews, sales transcripts, and survey free-response fields, anywhere a modern pipeline turns text into a tidy column. The pipeline could be a fine-tuned classifier, a zero-shot LLM, or an embedding-plus-cluster. The conceptual problem is the same: the column is not an observation of a customer attribute. It is the output of a generative process applied to a self-selected subset of customer behavior.

Practitioners increasingly treat outputs like these as if they were direct readings of customer state. They are not. They are generated variables: measurements produced by a pipeline, conditional on a customer doing something that left a textual trace, conditional on that trace surviving the extraction model. Every step of that conditional has consequences for what the variable means in a downstream causal model, and most of those consequences are invisible in the joined table.

Four things tend to go wrong, and the NULL move makes all four visible at once.

Selection. A theme exists for a customer because that customer called, complained, posted, or replied. Whatever drove that action is also probably correlated with the treatment, the outcome, or both. The NULL fill collapses “did not generate text” into the reference category, and the analysis is no longer estimating an effect over the customer base. It is estimating an effect over a redefined population, and the redefinition happened in preprocessing.

Timing. Was the call before the treatment, during it, or after? Pre-treatment text is a candidate confounder. Post-treatment text is a candidate mediator or outcome, and treating it as a pre-treatment control is a classic source of post-treatment bias. The joined table rarely makes this visible.

Measurement. The label “billing frustration” is not billing frustration. It is what the pipeline detected as billing-frustration-shaped language. Classifier accuracy is finite, and accuracy can differ across treatment arms, because a treatment that changes how customers talk also changes how the model reads them. The label noise is not orthogonal to the thing being studied.

Role. Is the theme acting as a confounder, a mediator, a treatment, an outcome, or a descriptive feature? The DAG decides this, not the column name. A variable that is methodologically valid in one role becomes a bias source in another.

These four problems are not independent. They interact. An LLM-detected theme inherits a selection footprint from the channel it came through, a timing footprint from when the text was generated, and a measurement footprint from the pipeline that extracted it. The downstream regression sees a column of zeros and ones.

The problem is not that the pipeline produced a bad label. The problem is that the label inherited a data-generating process the downstream analysis never modeled.

The rest of this article works through what that means in practice, where the standard workflow goes wrong, and what the minimum diagnostic looks like. We start with the role-and-timing question, because it is the one analysts get wrong first.

Role and timing are the same question

The first move an analyst makes with a transcript-derived theme is implicit: they treat it as a covariate. Themes go into the right-hand side of the regression. The treatment is the variable of interest. The outcome is on the left. The theme is “controlled for.”

That phrase, “controlled for,” is doing work the analyst hasn’t checked. Controlling for a variable adjusts away the part of the treatment-outcome relationship that flows through it. Whether that adjustment helps or hurts depends entirely on where the variable sits in the causal graph, and that position is determined by timing.

Pre-treatment text, generated before the treatment was assigned, can play the role of a confounder. If a customer called about billing in January and the retention offer went out in March, the call captures something about customer state that may influence both who got the offer and who churned. Conditioning on the theme here can reduce bias from omitted variables, provided the theme actually proxies for the relevant construct and the selection issues in the next section are handled.

Concurrent text, generated as part of the treatment itself, is not a covariate at all. If the treatment is a call from a retention agent and the theme comes from that same call, the theme is part of the intervention. Conditioning on it doesn’t adjust for confounding; it removes part of the effect the analyst is trying to measure.

Post-treatment text, generated after the treatment, is the most dangerous category, because it is the one most likely to be misclassified as a confounder by an analyst working from a flat table with no time index. A customer who received a retention offer in March and called complaining in April produced a transcript that reflects, at least in part, their response to the treatment. Conditioning on a theme extracted from that call is conditioning on a post-treatment variable. That can block mediation paths, induce collider associations, or otherwise shift the estimand away from the treatment effect the analyst thinks they are estimating.

A worked example makes this concrete. Consider a synthetic but business-realistic setup. Customers are targeted into a retention offer based on a model that picks up price sensitivity. Both the offer assignment and customer churn depend on this underlying price sensitivity, which the analyst does not observe. Customers who are more price-sensitive are more likely to receive the offer (because the targeting model selected them) and more likely to churn regardless. They are also more likely to call support and express bill shock. The theme “bill shock” is generated from those post-treatment calls.

The naive analyst joins the theme onto the customer table, fills NULL as zero, and runs a logistic regression of churn on offer plus bill-shock:

import numpy as np
import pandas as pd
import statsmodels.api as sm
 
rng = np.random.default_rng(7)
n = 20000
 
price_sens = rng.normal(0, 1, n)
offer = rng.binomial(1, 1 / (1 + np.exp(-(0.8 * price_sens))))
churn = rng.binomial(1, 1 / (1 + np.exp(-(-1.0 + 1.2 * price_sens - 0.5 * offer))))
called = rng.binomial(1, 1 / (1 + np.exp(-(-1.5 + 0.7 * price_sens + 0.9 * churn))))
 
theme_prob = 1 / (1 + np.exp(-(-0.5 + 0.8 * price_sens)))
bill_shock = np.where(called == 1, rng.binomial(1, theme_prob), 0)
 
df = pd.DataFrame({"churn": churn, "offer": offer, "bill_shock": bill_shock})
 
X = sm.add_constant(df[["offer", "bill_shock"]])
naive = sm.Logit(df["churn"], X).fit(disp=0)
print(naive.params)

The true effect of the offer on churn is −0.50 in log-odds. The offer is supposed to reduce churn, and in the data-generating process it does. Here is what four specifications return:

Figure 1. Same data, four specifications, four different answers.
Image by Author

Specification	Offer coefficient	What it says
Naive (with bill_shock)	+0.12	Offer appears harmful
Dropped (no bill_shock)	+0.24	Offer still appears harmful
Oracle (with price_sens)	−0.55	Offer reduces churn
True effect (DGP)	−0.50	Offer reduces churn

Because offer assignment is already confounded by price sensitivity, removing the bad control does not make the design valid. It only removes one additional source of distortion. Two observations from this table.

First, the naive specification is wrong in direction. Adding the bill-shock control to a model that was already biased flips the sign on the offer coefficient. The product team reading this output would conclude that retention offers cause churn. They would be wrong.

Second, dropping the bill-shock variable does not fix the analysis. The dropped specification is also positive, and only the oracle specification, which conditions on the unobserved confounder directly, recovers the true effect. In a real analysis the analyst does not have that column. That is the point. Removing a bad control is necessary but not sufficient, and a post-treatment theme extracted from a self-selected calling subpopulation is not a substitute for identification.

The mechanism behind the sign flip in the naive specification is worth walking through. Churn affects the likelihood of calling, because customers who are leaving are more likely to call. Bill-shock is only observed for customers who called, since the theme requires a transcript to exist. Conditioning on bill-shock therefore conditions on a downstream consequence of churn. Among customers with bill-shock equal to one, the relationship between offer and price sensitivity has been distorted, because both variables now help explain why the customer ended up flagged. The coefficient on offer absorbs that induced association.

The methodological point generalizes. A transcript-derived variable has a position in the causal graph determined by when the text was generated relative to the treatment, who generated it, and what process produced the label. Role and timing are the same question viewed through different lenses. These variables come with a structural footprint the analyst is responsible for tracing, and the joined table is not where the tracing happens.

The selection question

Most industry analyses using support transcripts implicitly redefine the population from “customers” to “customers who generated support language.” The estimand changes before the regression even begins.

This is the part that tends to matter most in practitioner workflows, and it is where the standard workflow is most fragile.

The text exists because the customer did something: called, posted, complained, replied. That something is a behavior, not a measurement. It is influenced by customer characteristics, by the channel that was available, by the urgency of the underlying issue, and often by the treatment itself. None of these are random. None are typically orthogonal to the outcome.

The NULL handling decision is where this becomes operational. There are three common moves, and each carries an assumption.

Filling NULL as zero or “no issue mentioned” assumes that not generating text is informative about the absence of the underlying construct. The analyst is claiming that customers who did not call did not experience the thing the theme is detecting. For most themes worth detecting, this is implausible on its face. Customers who did not call may have experienced billing frustration and resolved it by canceling, by switching to a competitor, by complaining on social media, or by giving up. The zero-fill turns all of these into “no frustration.”

Dropping rows with NULL themes, restricting the analysis to the calling subpopulation, is at least honest about the population, but it changes the estimand. The treatment effect among customers who called is not the treatment effect among customers, and the difference between the two is often the entire point of the business question. A retention offer’s effect on churn-prone callers is a useful quantity. It is not the quantity most analyses claim to estimate.

Treating text-presence as a missingness mechanism and applying inverse probability weighting based on a model of who calls is, methodologically, the right shape of move. The catch is the propensity model itself. Modeling who generates text requires writing down what drives calling, and that model depends on demographics, tenure, prior issues, treatment exposure, and unmeasured frustration, which is the construct the theme was supposed to help measure in the first place. The IPW move is principled, and it is also rarely as principled as it looks.

The deeper point is that selection into text is a behavior that interacts with the treatment. A retention offer may change calling rates. A pricing change may change complaint rates. A feature launch may change the kinds of issues customers articulate. Any of these makes the selection mechanism itself treatment-dependent, which means even a perfectly extracted, perfectly timed theme is being measured on a population whose composition shifts with the treatment. Standard observational corrections assume the selection mechanism is stable. When the treatment moves the selection, the corrections don’t.

None of this means transcript-derived variables are useless. It means the analyst owes the reader an explicit statement of which population the analysis is estimating an effect over, what mechanism produced the text, and what assumption was made about everyone whose text doesn’t exist.

The measurement question

Old NLP outputs looked noisy. TF-IDF weights, sparse keyword counts, LDA topic vectors: none of them looked like things a customer felt. Practitioners distrusted them by reflex, and that reflex saved a lot of bad analyses.

LLM outputs do not look noisy. They look like latent constructs. A label like “billing frustration” or “trust erosion” or “renewal anxiety” reads like a description of a customer’s mental state. The label is articulate, the categories are semantically coherent, and the failure modes don’t announce themselves in the column. The persuasion problem is real before the statistical problem starts.

The statistical problem is more familiar. An LLM theme is a noisy proxy for the underlying construct. The label “bill shock” is not bill shock. It is what the model decided was bill-shock-shaped language in the transcripts it processed. Classifier accuracy is finite even for clean tasks, and the accuracy on the actual population, not the held-out evaluation set, is often unknown. Plugging a noisy proxy into a regression in place of the true variable attenuates coefficients toward zero in some setups and distorts them in others, depending on whether the noise is differential.

Differential measurement error is where the real damage lives. If a treatment changes how customers talk, and most treatments worth running do, then the classifier’s accuracy on theme detection can differ between treatment and control. A retention offer that softens customer sentiment may reduce the rate at which the model flags “bill shock” language without reducing the underlying frustration. A pricing change that shifts how customers articulate complaints may move classifier accuracy more in one arm than the other. The label noise is no longer mean-zero. It is correlated with the treatment, and conditioning on the noisy label biases the estimated treatment effect in a direction the analyst cannot easily sign.

There is a literature on correcting for classifier-induced measurement error. Egami and colleagues develop a split-sample workflow for causal inference with text-discovered measures as treatments or outcomes in “How to Make Causal Inferences Using Texts”. Mozer and colleagues apply text-augmented matching to electronic health records and show how text-based covariates change estimated effects in a real medical study in “Leveraging text data for causal inference using electronic health records”. For the broader landscape, Keith, Jensen, and O’Connor review how text has been used to remove confounding across applications in “Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates”. These methods exist, and they are worth using when the analysis matters. They also require the analyst to recognize that a label is a measurement with error in the first place, which is the move most workflows skip.

The practitioner mistake is not using the label. The practitioner mistake is treating a label that came out of a generative model as if it were a column read off a sensor.

A practitioner checklist

A causal analysis that uses a generated variable derived from transcripts can still be defensible. It just has to answer five questions before the regression runs.

1. What role am I assuming this variable plays?

Confounder, mediator, treatment, outcome, or descriptive feature. The DAG decides. The column name does not.

2. When was the text generated relative to the treatment?

Pre-treatment, concurrent, or post-treatment. If the analyst cannot answer this from the data, the variable does not enter the model as a confounder.

3. What selection mechanism produced the text, and what am I assuming about everyone whose text doesn’t exist?

Zero-fill, drop, IPW: each is an assumption. Pick one and state it.

4. How was the label produced, and could its reliability differ across treatment arms?

If the treatment plausibly changes how customers express the underlying construct, classifier accuracy is not constant across the comparison the analysis is making.

5. What does the result look like under a stress test?

Refit the model without the transcript-derived variable. If the headline coefficient is fragile, the result is not stable enough to carry a causal claim on its own.

These five questions are not a solution. They are a diagnostic. An analyst who can answer them is not guaranteed an identified effect. An analyst who cannot answer them is doing descriptive work with causal language attached.

The broader pattern is older than LLMs. Generated variables are pipeline outputs that look like observations but are actually model outputs conditioned on selection. They show up in fraud scores, recommender relevance metrics, sentiment indices, propensity scores reused as covariates, and any latent-trait estimate produced by an upstream model and consumed by a downstream analysis. LLMs did not invent this mistake. They made it accessible at a scale and a fluency that older NLP outputs never reached. The labels look like latent constructs, the columns look like measurements, and the workflow looks like causal inference.

The assumptions did not disappear. They just moved upstream.

Staff Data Scientist focused on causal inference, experimentation, and decision science. I write about turning ambiguous business questions into decision-ready analysis.