The Fragility Lab
On the data. Every numerical example on this page is stylised for methodological illustration. No GiveWell cost-effectiveness estimate, intervention figure or grant amount is reproduced. The intent is methodological, not evaluative.

Most funding recommendations look solid in writing. There is a positive expected value, a couple of supportive trials, a pathway from input to outcome that an intelligent reader can follow without too many uncomfortable questions. The model has been built carefully and the assumptions have been ticked off in a footnote.

Then reality intervenes. In one country office the effect holds. In another it collapses, often for reasons no one thought to put in the model. The monitoring system measures activity rather than impact, and by the time the underlying question — is the intervention still doing what we paid for? — receives an honest answer, the next funding cycle has already been locked in.

The gap

Where the model and reality stop agreeing

Model trajectory: +4.8× expected
Realised average: +1.4×
Detection lag: ~3 years
The real question
Which uncertainties are decision-relevant?

Not whether the model has any uncertainty. Whether enough of it sits in places that would actually change the funding call.

The test
What would change our mind?

And would we recognise that evidence in time to change course, before the next funding cycle starts treating the current call as settled.

The discipline
Earn the right to be confident.

A research team tends to be judged less by the confidence of its first answer than by what it later admits to having got wrong.

01 The expected-value trap

Move the sliders. Watch a 4.8× recommendation lose its shape.

Each parameter below is something a real research team has to estimate from imperfect evidence. The chart on the right shows the resulting distribution of plausible true returns. The dots are a hundred sample scenarios under your current assumptions. The optimizer's curse lives in how easily the headline number sits at the optimistic end of a much wider distribution.

Drag the four sliders. Then look at the dots.
Stylised intervention A · model inputs
Adjust the assumptions
Pooled effect size0.42 SD
0.10 weak0.80 strong
Implementation quality85%
40% weak100% trial-level
External validity discount15%
0% transfers fully60% does not
Cost variance (relative SD)±10%
±5%±50%
Defaults match the headline 4.8× recommendation
Recording
Point estimate
4.8×
Headline number
95% CI
0.9× — 8.7×
Plausible range
P (below break-even)
11%
Chance net return < 1×
Years to detect
~4
Time before signal arrives
break-even (1.0×) point estimate 9× return
Live reading
A strong recommendation, but the lower end of the credible range is close enough to break-even that monitoring needs to be serious.
Weakens
Try this
Pull implementation quality down to 60% while keeping the effect size at default. The headline number barely moves, but the credible interval swallows the break-even line, and the red dots multiply. The point estimate was hiding most of the risk.
02 External validity collapse

Drag the slider. Watch the same evidence produce a very different outcome.

I saw a version of this pattern repeatedly across 160+ UNDP country offices. The same procurement framework would produce a reliable signal in one operational setup and noise in another, less because of the framework itself than because of what was actually working in the field that quarter. Slide the context fidelity below to see how the realised effect changes. The original 4.8× sits on the left as a constant reminder of what the model promised.

Drag the fidelity slider all the way down to zero.
Stress test
Context fidelity to trial conditions 75%
0% — different epidemiology, weak systems, cultural barriers 100% — comparable institutions, trained staff, reliable supply
Jump to a real-world context:
4.8×
Promised
What the model said
2.7×
Realised
In this context
Context · Partial fit
Implementation quality settles below trial levels. Supply chains are uneven, staff turnover is high, and the true effect probably sits at around half the headline number. The intervention may still be worth funding, but the margin of safety is thinner than the model suggests, and the monitoring needs to be stronger before any scale-up.
The shock moment
Below roughly 25% fidelity, the realised effect drops below zero. The recommendation isn't just weak in that setting. It's actively destroying value — and the original model gave us almost no warning that this was possible.
03 What would change our mind?

The monitoring and evidence gaps that actually matter.

Most monitoring systems are built to confirm that activities took place. Many fewer are built to detect whether the intervention is still working. The gap between those two questions is where most late-detected failures sit, quietly accruing, until someone finally looks for the right thing in the right place.

Tick the boxes that match your own situation. Watch the meter on the right.
Monitoring

Each box below is a gap that any honest research team will recognise. The meter on the right gives a rough estimate of the confidence we would have that major underperformance would be detected in time to change course, rather than in time for a polite paragraph in the lookback.

Select the gaps that apply
Confidence we would detect failure in time
78 %
With current monitoring, we would probably notice large underperformance. Eventually.
Time to first credible signal ~2 yrs
The pattern
Three boxes ticked drops confidence below 45% and pushes the detection window beyond five years. By the time the system catches up, the next funding cycle is already locked in, and the lookback explains the loss instead of preventing it.
04 The fragility map

Now zoom out. See the entire parameter landscape at once.

You've moved sliders one at a time and tested one context at a time. The heatmap below shows the expected value for every combination of two parameters, with the others held at their defaults. Each cell is one possible reality. Green is robust, red is breaks, and the amber band in the middle is where most real-world grantmaking actually lives. The ★ marks the cell where the headline 4.8× recommendation currently sits.

Swap the two axes. Hover any cell for the detailed reading. Look for the cliff.
Scanning
X axis (horizontal)
Y axis (vertical)
Pooled effect size
0.80 0.45 0.10
40% 70% 100%
Implementation quality
Expected return (×) — colour scale
1× break-even 7×+
Iso-contour rings: 1× break-even 2× fragile-robust 3.5× confidence band 5× neon ridge
Robust region
42%
cells with EV ≥ 1.5×
Breaks region
23%
cells with EV < 1×
Cliff steepness
Moderate
how sharply EV falls across the grid
The full picture
The headline 4.8× sits in a single corner of this landscape. Most real grantmaking happens in the amber band where the answer is genuinely uncertain. Try plotting external validity against monitoring quality — the red corner expands fast, because poor monitoring multiplies the cost of being wrong. The job of cross-cutting research is to know which dimensions push you toward green and which push you off the red cliff.

None of this is an argument against ambitious global health funding. It is an argument for building research and monitoring systems that are honest about how little they sometimes know, and how late that knowledge tends to arrive.

References & further reading

Public sources behind the framing.

None of the numbers on this page reproduce a GiveWell estimate or grant figure. The concepts and language, though, are taken directly from GiveWell's published work on cost-effectiveness, mistakes and moral weights, and from the broader effective altruism research conversation around uncertainty and red-teaming.

About this piece

Built as a thinking tool, not a deliverable.

The Fragility Lab started as a way to clarify my own thinking about how cost-effectiveness recommendations actually fail in the field, and how late a research team usually finds out about it. The tool is small and stylised. The intention behind it is not.

It sits in a portfolio of solo-built interactive evidence tools, alongside Travel Shockwaves (a global travel-disruption engine, my entry to the Capgemini UK Visualisation Guild Challenge 2026) and The Civilization Lab (193 nations across 1900–2050, 50+ indicators, currently in private development). All three are built end-to-end as single-file vanilla HTML, through a documented multi-model AI orchestration protocol in which I keep the editorial work to myself.

My background is twenty years across UNDP, UNICEF Supply Division, WHO Europe, LEGO and Capgemini, mostly working on the messy institutional data that senior decisions actually depend on. Ten of those years were at UNDP in cross-portfolio analytics across 160+ country offices, including health commodity procurement (bed nets, ARVs, TB and malaria treatments) with Global Fund PQR data alongside UNDP Atlas records.

AuthorOriol Cervantes Grau
BuiltJune 2026
StackVanilla HTML / CSS / JS
ApproachMulti-model AI orchestration
SourcesGiveWell, UN, official statistics