Picking an Experimentation Platform: A Retrospective

, in every company that wants to ship products people love, when “we should experiment more” becomes “we cannot keep experimenting like this.” Hand-tuned holdouts; traffic-allocation tickets bouncing between PMs and engineers; analyst calendars booked weeks out. The wish to be data-driven sort of outgrows the machinery that was supposed to make it so.
That was where we sat at ManyChat last year. We chose Eppo, but that decision is the smallest part of the story, and the part you can least transplant to your company. What I want to share instead is the process I walked through to get there, what I got wrong along the way, and what surprised me on the other side of the contract (yep, doctors hate me for this trick).
A note on timing. We picked Eppo at an unusually exciting moment in the industry, as the vendor map was shifting under us mid-evaluation. Eppo itself had been acquired by Datadog some months before. Statsig had recently been acquired by OpenAI, and would later be sold on to Amplitude. I do not think any of what I describe below depends on that particular news cycle, but I want to acknowledge that some of it shaped our mood while we were deciding.
I break what follows into three acts: before the decision, during it (making the decision), and after.
Before
Let me get you in the mood we were in before everything happened. As I onboarded to the company, an engineer told me that if there were two simultaneous opportunities to run experiments, his team would simply postpone the second idea to a later sprint because the technical headache of configuring the two allocations. The risk of getting it wrong eventually outweighed the excitement to test. This is quite literally: anti-velocity at best; no experiment at worst. And for that one experiment that would be configured, copy-pasting boilerplate allocation logic was their bread and butter.
An analyst on the other side of that same pipe described herself as a “human microservice”; she meant the holdout groups, defined by hand, refreshed by hand, passed on to the engineer, and so on … an exciting opportunity to experience the entire flow in first-person POV, indeed. But, irony aside, that was the moment the case for a platform stopped being abstract.
I had seen versions of this room before. At Marktplaats, some years earlier, I had written the in-house Python libraries that try to absorb this kind of pain, and we saw time-to-insight go down from days to hours, in the tail cases.
I watched the same build-or-buy debate play out again at Adevinta, globally, at a larger scale, where it landed on building rather than buying. Lucky for us at Manychat, by the end of 2025 the platform offerings had matured enough that, for a company our size and at that moment, buying was the obvious move.
We wanted the tool that would give us the best shot at getting our experimentation program where we want it: cutting-edge statistics, yes, but more importantly a tool that nudges its users toward conclusive experiments by default; product managers included.
Two problems stood between us and the choice. The first was simple: we had named the pain, but it was only anecdotal so far. Leadership had a (very good) notion of what was broken, and I had heard devs and product managers grumble about the current stack when I first met them. But none of that was the same kind of object as a vendor requirements list. Until we could put the two side by side, we could not tell which capabilities were nice-to-haves and which were the point.
The second was harder. The decision carried a lot of weight as no matter how you put it, there is always a lock-in element to any platform; culturally, if not technically. And resources are finite: we could not POC every platform on the market. Let alone the opportunity cost of having to reverse the decision and start over again. Choosing one to bet on, in one sitting, with no chance to course-correct, would have been asking to be wrong. And with the offerings being so similar in most ways, finding the best one for us was a matter of precision. We needed a way to break a single high-stakes decision into smaller, lower-stakes ones that built on each other.
Interviews, and de-risking the decision
I started with interviews. PMs, product analysts, engineers, marketers. The point was to convert anecdote into something we could hold up against a vendor’s feature list. The engineer’s calendar story, the analyst’s “human microservice”, the PM who had given up on running atomic experiments and was bundling changes into bigger releases instead, postponing some of them entirely: these became the job description for the tool. I cannot overstate how much this paid me back later. Every time the process drifted, and it drifted, the interviews were the anchor we came back to. They were also what made the whole effort credible inside the organization: telling my CPO why we were spinning up a POC was a different conversation when I could quote a specific friction back to her.
For the single-shot problem, we phased the discovery into three layers, each focusing on the next level of depth in the evaluation:
- Desk research. Read the vendor docs, sketch a long list. Most platforms self-eliminated here, before we ever opened a sales funnel. Plenty of Claude Code at this step, too.
- Demos. A focused conversation with each shortlisted vendor. A little sales pitch, sure, but mostly us probing the areas we had decided mattered most.
- POC. Hands on the platforms, with real data and real evaluators, only for the two finalists.
Each layer narrowed the field and bought us information at a “price” we could afford. By the time we reached the POC we were down to two, and the decision in front of us had shrunk to something we could actually hold. Statsig, or Eppo?
There is one part of this I would repeat on day one of any future platform decision, in any category: the interviews define those pain points. They were the single biggest unlock of the whole stage. Running close behind them, sponsorship. And I don’t mean just from my director, who asked to push it forward. I kept peers and stakeholders who would have to back / adopt the decision in the loop the whole way through. By the time the POC ended, the decision surprised no one.
At the end of “before” we had a shortlist of two, and the discipline of how we had narrowed to them. We knew what worked for us. The harder question was still waiting: between two platforms that both cleared our bar, which was actually better for us? How would we define “better” conceptually, and how would we agree on it practically?
During
It was the debrief, after the POC, and the analysts on the panel were taking turns talking. Two of them, who knew our stack best, finished their summary with a sentence similar to:
“As a product analyst, I would be really happy to move forward with either of them.”
I sat with that for a moment. The consolidated scores agreed with them: the two platforms came in at 4.36 and 4.47 on a five-point scale, across more than twenty weighted criteria. By any reasonable read, it was a tie. I had spent weeks building a process that would point clearly at one platform, and the process had just told me, in the voice of the peers I trusted most to spot a meaningful difference, that there was no meaningful difference from his seat.
What I learned in that moment, and would not have learned without the panel, is that analyst-grade rigor has become table stakes. The marginal value of choosing one modern experimentation platform over another does not accrue to your scorecard; it accrues somewhere else. Where, exactly, was the question I now had to answer.
So I needed a decision I could defend; to myself first, then to my data director and CPO, then to the teams who would inherit it. Coin flips and personal preferences are bad foundations for a multi-year contract. And the tie meant the tiebreaker could not be invented after the fact; it had to reflect what we actually wanted from the next few years of experimentation at ManyChat.
Namely, we were not choosing between two snapshots; we were choosing between two trajectories. Eppo’s bet was on guided, opinionated, PM-shaped *cough * proof *cough * workflows; Statsig’s was on power-user flexibility. Both were defensible for sure. But we had said, recall:
We wanted the tool that would give us the best shot at getting our experimentation program where we want it: cutting-edge statistics, yes, but more importantly a tool that nudges its users toward conclusive experiments by default (…)
I noticed what did not happen. The POC plan called for PMs to trial both platforms and feed scores back into the matrix. They mostly did not because of bandwidth. One head of marketing operations and one PM gave me unprompted impressions, and the rest of the PM-side evidence and input stayed thin. The absence of PM feedback did something counterintuitive: it increased the weight I gave to PM-facing UX / workflows, and governance, in the final call. The logic is asymmetric. Analysts are adaptable, power-users if you will; they will work their way through whatever interface you hand them. PM onboarding is not adaptable in the same way. If the platform our analysts rated equally is also the one that lowers the barrier for our PMs, that is a decision; the reverse, picking the analyst-equivalent platform our PMs would have struggled with, would have been quiet self-sabotage.
In short, we could finally say: everything else near-equal, the usability for non-technical folks is what sets the two platforms apart.
So we picked Eppo. The trajectory question is what tipped it: on a longer horizon, Eppo lined up better with where we wanted experimentation to live; closer to experimenting teams, and beyond just the analyst. Knowledge management as a first-class object. Reporting that does not need a deck rebuilt around it. Statsig had its advantages too; CUPED (a variance-reduction technique) inside its power calculator, a standalone metrics explorer, a more flexible analysis surface; and we accepted those as Year 1 gaps to work around, while Eppo was being revambed within Datadog, and acquiring those features too.
Looking back, the lesson I take away from it is double-edged. The decision needed more rigour than instinct wanted, and then less faith in that rigour than I expected. The scorecard mattered because it forced everyone to be specific, and to create a sense of trust and credibility in the outcome. It gave me 360-degree coverage, but the call came from the moments inside it: the analyst tie, and the vision question. Six months after signing, a curious colleague would ask me how we had picked, and I could walk them through the panel, the scorecard, the corrections, and the vision/framing question. That’s a win for me.
After
I think I expected, somewhere I would not admit aloud, that signing the contract was the finish line. I had spent weeks building a credible decision system, a process, and had spent a couple of hours of vendor calls. The week we signed I had a quiet day. I sat down at my desk and started a working document about what would happen next. Legend has it that I am still writing it.
The clean-water metaphor I had used in the proposal kept coming back to me. We had laid the pipes; that was the SDK integration, the data plumbing, the warehouse connections. The platform itself too, if you will. Pipes get you flow, but not clean water. In the worst case, pipes contaminate it instead (more crap output, faster). Clean water is what comes out of pipes when the rest of the system (the source, the treatment, the people who maintain it) does its job. Experiments work the same way: a platform gets you the flow, but the trustworthy results come from governance and process, from people, and from how seriously the organization treats the difference between testing an idea and launching a feature.
The tool is ready; the organization is not yet ready for the tool.
Till that point I was deep in the cost of the contract, but not the cost of bridging the gap between the tool is present now and the organization is ready to use it.
I had told colleagues, in the weeks leading up to signing, that a chunk of the analytics team’s capacity would slowly ramp up to a new equilibrium once Eppo was live. As of writing, I am still hopeful that will materialise a quarter or two from now; but not before we get some things in place first. Velocity, the mere act of experimenting more in a given period, also has to wait.
Signing did not buy time back yet, nor did it bring us more experiments right away. The work that started the day after signing, forming a cross-functional integration group, drafting the experiment lifecycle, configuring Eppo protocols (part of its governance framework), certifying our first success metrics and guardrails, migrating a knowledge base, designing a training curriculum, all had to happen before the platform could deliver the velocity potential we knew it had. En breve, what was ahead was not a tool problem. Rather, a governance, process, and people one.
Three legs of a stool
For experiments to actually be trustworthy at Manychat, three things have to be present at the same time: the tooling, and engineering integration so experiments can flow through the platform, process and governance so the experiments that flow through are properly designed and decided, and people and skills so the best practises are followed in practice and not only on paper. Drop any one of the three and the whole thing leans.
We had the tool and the connections now. Process and governance was mostly on the data science team: a five-stage experiment lifecycle (Propose, Design, Run, Analyse, Decide); a certified set of success and guardrail metrics; all of it encoded into the platform’s own protocol templates so that the rails were not a Notion page but a feature of the tool. People and skills are to be materialised in ad hoc Eppo-delivered tool quick-starts, and an Experimentation 101 and 102 curriculum in the long term. An ongoing argument for a graduated autonomy model, PMs paired with analysts at first, more independence over time; that’s the dot at the horizon.
The other thing
A milder lesson: signing Eppo was where my job description changed. I had walked into the project as the Staff responsible for picking a tool. I walked out doing change management; onboarding teams, teaching, leaning on PMs about lifecycle compliance, spending credibility I had banked for other things. It was totally worth it for me, though.
Closing notes
If I had to compress all of this, these would be the few lines I’d fit it in:
A credible decision is the deliverable, not the platform. The platform is an artifact. The decision is what your organization will live inside for years.
In the same spirit, pipes are not water. A tool is necessary infrastructure for trustworthy experimentation, but not sufficient. The work begins, not ends, on the day the contract is signed.
I am writing all of this knowing the experimentation tools market is in motion; the vendor churn I flagged up top has not stopped. Whatever the map looks like by the time you read this, the bits of process that survived for me are probably the bits worth borrowing: the interviews, the phased discovery, the vision framing, and the honest budgeting for what comes after.
If you want to dive into the details over an online cup of coffee, feel free to ping me on LinkedIn! I’d be happy to share ideas with you.
Also check out my personal page for more piece like this.



