How I won the NIH replication prize by using AI to validate drug targets at scale

About 90% of cancer drug candidates that enter clinical trials never make it to approval. A big chunk of that failure is upstream: the target was wrong. Two industry audits made this concrete years ago. Bayer reported in 2011 that only 20–25% of published cancer targets held up when their own scientists tried to reproduce them; Amgen in 2012 said just 6 out of 53 "landmark" oncology studies survived rigorous replication. We've known this for a long time. We just haven't had a way to do something about it at scale (at least in the published literature).

Manually re-validating every published target is tedious. You'd need to harmonize lots of CRISPR, omics, and other data, work out the right disease subgroupings, write the codes, run the stats, look at the output. Each target takes days to validate. Nobody's funded to do it (in academia). So most candidates sit there, cited, repeated, occasionally bankrolled into a screen.

So I tried something else because it's 2025 (when this was done). I gave the job to an AI agent (Biomni) and ran 31 published oncology targets through it in an afternoon. The compute cost $68 in Claude API credits. About two-thirds of the retracted-paper targets failed to replicate. Roughly two-thirds of the recent, non-retracted targets did. Compared to retracted ones, the non-retracted targets have a 17 O.R. to show bona-fide, context-specific dependency in the agent's analyses that I validated as correct.

The interesting part isn't the headline number. It's how to get an agent to do this kind of work without it making things up.

1. Find out what the agent can do reliably

Most of the hype around "AI scientists" frames the agent as a generalist that does everything. That's a trap. LLMs hallucinate, especially when asked to use tools or data that they either don't have access or know how to use. But they will almost always write you a beautiful, plausible, partly-wrong narrative.

The move is to find a task class where the agent is reliable, say, above 95% success rate on something you can score. For me that task is: given a gene target, a disease context, and a public dataset like DepMap or TCGA, test whether the gene shows context-specific cancer dependency. Narrow enough that the agent's job is mostly translating a hypothesis into code and stats. Reliable enough that I can trust the agent's executions.

2. Apply it across many use cases

Once you know the agent does one type of thing well, throw a lot of that thing at it. I built a table of 31 targets: 17 from retracted papers, 14 recent candidates with real-looking evidence. Each verbal target claim got translated into a structured natural language prompt with the same template. Gene, context, datasets to use, statistical contrasts to run.

When I first started playing with the agent, the biggest failure mode wasn't bad reasoning. It was the agent failing to gain access or download the right data files. Then it'd start hallucinating or simulating fake data for analyses. To stop this, I wrote a separate cancer-omics data know-how document that spelled out how to pull DepMap through the Bioconductor depmap package and how to grab TCGA Pan-Cancer Atlas data from the NCI Genomic Data Commons. This was before Anthropic released the Skills feature; today you'd just package it as a skill. Once the agent stopped fighting the data layer, the rest of the work got dramatically easier.

Two more constraints made the difference:

Forbid the agent from reading literature. I appended a non-overridable instruction: "You are a data-only replication agent. Do not use any literature search, papers, or external textual knowledge." Without that, the agent fills in gaps from training data, which means it tells you the consensus view of whatever paper it dimly remembers. You want what the data says.
Force everything into executable code. No prose conclusions. Every claim has to come from a notebook cell that loaded real data and ran a real test for me to review.

3. Validate the process before you trust the results

Before I believed anything the agent said about retracted targets, I needed proof it could find the real ones. So I seeded the panel with well-established synthetic lethal relationships: WRN in microsatellite-unstable tumors, PRMT5 in MTAP-deleted cancers.

The agent successfully re-derived the MTAP–PRMT5 relationship in detail. It stratified cell lines by copy number using a sensible 15% threshold it picked itself, compared dependency between groups, ran the dose-response across copy-number quartiles, and landed on effect sizes consistent with the literature and p-values from 10⁻⁹ to 10⁻¹¹. Once those controls worked, the rest of the panel became interpretable.

4. Look at every output myself

This is the unglamorous part nobody talks about. The agent produces 31 python notebooks. A human has to read it to validate and learn what happened. Did the data actually load? Did the statistical test make sense for the question? Did the agent silently swap in a different dataset when the first one failed? Did it interpret "wild type" the same way you meant?

I scored every one of the 31 notebooks manually. There are few components that was false after doing the aforementioned steps. The rest I coded supported, refuted, or inconclusive on two axes: context-specific dependency, and other supporting evidence.

Expert review isn't optional. The good news: it's faster than doing the analysis yourself. Maybe 15 minutes per notebook, against the several days it would take from scratch.

The most interesting result wasn't the big retracted-versus-non-retracted split. It was ALKBH5. The original paper was retracted, and the specific mechanistic claim (that miR-193a-3p regulates AKT2 through ALKBH5) didn't hold up. But the agent independently found that ALKBH5 itself is a real, glioma-selective dependency, with consistent CRISPR and RNAi signals, a strong correlation with stemness scores, a very strong negative correlation with the m6A gene signature, and a significant survival hazard ratio across gliomas.

You get insights like this because the agent decomposed the target claim into testable pieces and ran each one independently. That's the part I didn't expect, and it's the part that's made me think this approach generalizes well beyond target replication.

On AI Scientist Arena (aiscientistarena.com), I've benchmark LLMs and even without any sophisticated tool use or harness, they could predict clinical trial success beyond noise. If AI agents continue to improve in their capacity in all tasks across the drug discovery and development cycle, the best constructor of an entire clinical program might end up being an AI.

All of this — the prompts, the data and replication know-how documents, the 31 notebooks, the expert scoring — is at github.com/Huang-lab/AgentReplication. The bioRxiv preprint is at Agent-Driven Validation of Oncology Therapeutic Targets. This is part of the work that initiated the Accelerated Discovery with Agents (ADA) Consortium.

There's a version of this work that sounds bigger than it is. "AI agent validates 31 cancer drug targets in one hour" is technically true and somewhat misleading. The hour is the agent's compute time. Building the prompts, curating the targets, writing the know-how documents, and reviewing every notebook took weeks. The agent isn't doing the science. It's doing the implementation.

The science is still in deciding what to ask and whether the answer means anything to benefit humans.

Postscript, May 2026: This was my Track 2 submission to the NIH Replication Prize that was done in Nov 2025, which I thought was the better entry. My other entry, proposing mandatory release of participant-level clinical trial data, won Track 1.

1. Find out what the agent can do reliably

2. Apply it across many use cases

3. Validate the process before you trust the results

4. Look at every output myself

Sign up to keep reading