The Fragility Lab — Why strong global health recommendations break

Most funding recommendations look solid in writing. There is a positive expected value, a couple of supportive trials, a pathway from input to outcome that an intelligent reader can follow without too many uncomfortable questions. The model has been built carefully and the assumptions have been ticked off in a footnote.

Then reality intervenes. In one country office the effect holds. In another it collapses, often for reasons no one thought to put in the model. The monitoring system measures activity rather than impact, and by the time the underlying question — is the intervention still doing what we paid for? — receives an honest answer, the next funding cycle has already been locked in.

The gap

Where the model and reality stop agreeing

Model trajectory: +4.8× expected

Realised average: +1.4×

Detection lag: ~3 years

The real question

Which uncertainties are decision-relevant?

Not whether the model has any uncertainty. Whether enough of it sits in places that would actually change the funding call.

The test

What would change our mind?

And would we recognise that evidence in time to change course, before the next funding cycle starts treating the current call as settled.

The discipline

Earn the right to be confident.

A research team tends to be judged less by the confidence of its first answer than by what it later admits to having got wrong.

01 The expected-value trap

Move the sliders. Watch a 4.8× recommendation lose its shape.

Each parameter below is something a real research team has to estimate from imperfect evidence. The chart on the right shows the resulting distribution of plausible true returns. The dots are a hundred sample scenarios under your current assumptions. The optimizer's curse lives in how easily the headline number sits at the optimistic end of a much wider distribution.

Stylised intervention A · model inputs

Adjust the assumptions

Pooled effect size0.42 SD

0.10 weak0.80 strong

Implementation quality85%

40% weak100% trial-level

External validity discount15%

0% transfers fully60% does not

Cost variance (relative SD)±10%

±5%±50%

Defaults match the headline 4.8× recommendation

Recording

Point estimate

4.8×

Headline number

95% CI

0.9× — 8.7×

Plausible range

P (below break-even)

11%

Chance net return < 1×

Years to detect

Time before signal arrives

Live reading

A strong recommendation, but the lower end of the credible range is close enough to break-even that monitoring needs to be serious.

Weakens

★

Try this

Pull implementation quality down to 60% while keeping the effect size at default. The headline number barely moves, but the credible interval swallows the break-even line, and the red dots multiply. The point estimate was hiding most of the risk.

02 External validity collapse

Drag the slider. Watch the same evidence produce a very different outcome.

I saw a version of this pattern repeatedly across 160+ UNDP country offices. The same procurement framework would produce a reliable signal in one operational setup and noise in another, less because of the framework itself than because of what was actually working in the field that quarter. Slide the context fidelity below to see how the realised effect changes. The original 4.8× sits on the left as a constant reminder of what the model promised.

Stress test

Context fidelity to trial conditions 75%

0% — different epidemiology, weak systems, cultural barriers 100% — comparable institutions, trained staff, reliable supply

Jump to a real-world context:

4.8×

Promised

What the model said

2.7×

Realised

In this context

Context · Partial fit

Implementation quality settles below trial levels. Supply chains are uneven, staff turnover is high, and the true effect probably sits at around half the headline number. The intervention may still be worth funding, but the margin of safety is thinner than the model suggests, and the monitoring needs to be stronger before any scale-up.

★

The shock moment

Below roughly 25% fidelity, the realised effect drops below zero. The recommendation isn't just weak in that setting. It's actively destroying value — and the original model gave us almost no warning that this was possible.

03 What would change our mind?

The monitoring and evidence gaps that actually matter.

Most monitoring systems are built to confirm that activities took place. Many fewer are built to detect whether the intervention is still working. The gap between those two questions is where most late-detected failures sit, quietly accruing, until someone finally looks for the right thing in the right place.

Monitoring

Each box below is a gap that any honest research team will recognise. The meter on the right gives a rough estimate of the confidence we would have that major underperformance would be detected in time to change course, rather than in time for a polite paragraph in the lookback.

Select the gaps that apply

We have no timely data on implementation quality in more than sixty per cent of sites. Outcome data arrives eighteen to thirty months after decisions have been locked in. We lack credible comparison groups or synthetic controls for most of the grant portfolio. The external validity adjustment is, in practice, a single analyst's judgement, not a structured piece of evidence. Failure modes haven't been pre-specified, so when one shows up we are unlikely to recognise it as the thing to worry about.

Confidence we would detect failure in time

78 %

With current monitoring, we would probably notice large underperformance. Eventually.

Time to first credible signal ~2 yrs

★

The pattern

Three boxes ticked drops confidence below 45% and pushes the detection window beyond five years. By the time the system catches up, the next funding cycle is already locked in, and the lookback explains the loss instead of preventing it.

04 The fragility map

Now zoom out. See the entire parameter landscape at once.

You've moved sliders one at a time and tested one context at a time. The heatmap below shows the expected value for every combination of two parameters, with the others held at their defaults. Each cell is one possible reality. Green is robust, red is breaks, and the amber band in the middle is where most real-world grantmaking actually lives. The ★ marks the cell where the headline 4.8× recommendation currently sits.

Scanning

X axis (horizontal)

Y axis (vertical)

Pooled effect size

0.80 0.45 0.10

40% 70% 100%

Implementation quality

LIVE READOUT SCANNING

Crossing coordinates

X — —

Y — —

Expected return at crossing

—× —

Awaiting first reading…

Grid distribution (2,500 scenarios)

Robust ≥2× —%

Fragile 1–2× —%

Breaks <1× —%

Iso-contours detected

1× break-even —

2× fragile band —

3.5× confidence —

5× neon ridge —

Expected return (×) — colour scale

0× 1× break-even 3× 5× 7×+

Iso-contour rings: 1× break-even 2× fragile-robust 3.5× confidence band 5× neon ridge

Robust region

42%

cells with EV ≥ 1.5×

Breaks region

23%

cells with EV < 1×

Cliff steepness

Moderate

how sharply EV falls across the grid

★

The full picture

The headline 4.8× sits in a single corner of this landscape. Most real grantmaking happens in the amber band where the answer is genuinely uncertain. Try plotting external validity against monitoring quality — the red corner expands fast, because poor monitoring multiplies the cost of being wrong. The job of cross-cutting research is to know which dimensions push you toward green and which push you off the red cliff.

References & further reading

Public sources behind the framing.

None of the numbers on this page reproduce a GiveWell estimate or grant figure. The concepts and language, though, are taken directly from GiveWell's published work on cost-effectiveness, mistakes and moral weights, and from the broader effective altruism research conversation around uncertainty and red-teaming.

GiveWell · Public mistakes log

Our Mistakes

GiveWell · Methodology

Cost-Effectiveness

GiveWell · Moral weights

2020 Moral Weights Research

GiveWell · Red teaming

Change Our Mind Contest Winners

Cold Takes · Holden Karnofsky

The Wicked Problem Experience

GiveWell · Intervention reports

Intervention Research Library

About this piece

Built as a thinking tool, not a deliverable.

The Fragility Lab started as a way to clarify my own thinking about how cost-effectiveness recommendations actually fail in the field, and how late a research team usually finds out about it. The tool is small and stylised. The intention behind it is not.

It sits in a portfolio of solo-built interactive evidence tools, alongside Travel Shockwaves (a global travel-disruption engine, my entry to the Capgemini UK Visualisation Guild Challenge 2026) and The Civilization Lab (193 nations across 1900–2050, 50+ indicators, currently in private development). All three are built end-to-end as single-file vanilla HTML, through a documented multi-model AI orchestration protocol in which I keep the editorial work to myself.

My background is twenty years across UNDP, UNICEF Supply Division, WHO Europe, LEGO and Capgemini, mostly working on the messy institutional data that senior decisions actually depend on. Ten of those years were at UNDP in cross-portfolio analytics across 160+ country offices, including health commodity procurement (bed nets, ARVs, TB and malaria treatments) with Global Fund PQR data alongside UNDP Atlas records.

AuthorOriol Cervantes Grau

BuiltJune 2026

StackVanilla HTML / CSS / JS

ApproachMulti-model AI orchestration

SourcesGiveWell, UN, official statistics

Contactoriol.cervantes@gmail.com

LinkedInin/oriol-cervantes