LLMs Fail Biomedical Code Test: New Agent Hits 74% Accuracy

by Maya Grant

A Nature Biomedical Engineering benchmark shows LLMs under 40% accurate on 293 biomedical coding tasks, but a new iterative AI agent reaches 74% by refining plans first. A collaborative platform lets researchers complete 80% of real study code.

LLMs Fail Biomedical Code Test: New Agent Hits 74% Accuracy

Biomedical researchers hoping to lean on large language models for data analysis face a stark reality: these tools falter badly on real-world coding tasks. A new benchmark from University of Illinois researchers reveals that even top proprietary and open-source LLMs score below 40% accuracy when generating code for biomedical data science, raising alarms about blindly trusting AI outputs in high-stakes research.

The study, published January 22, 2026, in Nature Biomedical Engineering , introduces BioDSBench, a rigorous test set of 293 coding tasks pulled from 39 peer-reviewed studies spanning seven areas: biomarkers, integrative analysis, genomic profiling, molecular characterization, therapeutic response, translational research, and pan-cancer analysis. Tasks demand everything from plotting survival curves to integrative multi-omics visualizations, using real anonymized patient data from cBioPortal and UCSC Xena.

“Large language models (LLMs) can generate impressive data visualizations from simple requests, yet their accuracy remains underexplored,” the authors write, led by Zifeng Wang and Benjamin Danek of Keiji AI and the University of Illinois Urbana-Champaign, with corresponding author Jimeng Sun.

Advertisement

article-ad-01

Benchmark Exposes Cracks in AI Foundations

Eight proprietary models—GPT-4o, Claude 3.5 Sonnet, Gemini 1.5, OpenAI o3-mini—and eight open-source ones like Llama 3, Code Llama, Qwen2.5-coder, and Deepseek-R1 were pitted against BioDSBench under chain-of-thought prompting and retrieval-augmented generation. None broke 40% overall accuracy. “This low accuracy raises serious concerns about the risk of propagating incorrect scientific findings when blindly relying on AI-generated analyses,” the paper warns.

Proprietary models edged out open-source counterparts slightly, but both struggled with biomedical nuances like handling genomic datasets or therapeutic response metrics. The benchmark, hosted on Hugging Face at https://huggingface.co/datasets/zifeng-ai/BioDSBench , includes reference solutions and test cases for reproducibility.

Recent web searches confirm this isn’t isolated: a npj Digital Medicine paper on medical LLMs notes persistent gaps in specialized knowledge, while Scientific Reports highlights scalability issues in oncology tasks.

Iterative Agents Rescue Reliability

To fix this, the team built an AI agent that drafts and iteratively refines an analysis plan before coding, drawing on ReAct reasoning-acting synergy and self-refine feedback loops. This boosted accuracy to 74%, nearly doubling baseline performance. The agent breaks complex tasks into steps: plan, code, test, refine.

Code for the agent lives on GitHub at https://github.com/RyanWangZf/BioDSBench . In a user study, five medical researchers used a new human-AI platform to codevelop plans and execute them, finishing over 80% of analysis code for three real studies. The platform, accessible via https://keiji.ai/contact.html , integrates planning, coding, and execution in one environment; see the demo at https://www.youtube.com/watch?v=c5ZJsFXQ_B0 .

“Benchmarking eight proprietary and eight open-source LLMs under various prompting strategies reveals an overall accuracy below 40%,” per the Nature paper. Figures in the study detail model comparisons (Fig. 2) and adaptation strategies (Fig. 3).

Real-World Ripples and Broader Warnings

X posts from insiders like Stephen Turner echo the findings, linking the paper as essential reading for data scientists. Broader critiques abound: npj Digital Medicine calls out shaky foundations for electronic health records, and npj Precision Oncology scrutinizes oncology applications.

The low scores underscore why biomedical work can’t afford hallucinations—errors in genomic profiling or pan-cancer analysis could mislead drug development. Yet the agentic fix points forward: structured planning tames LLM chaos. Ziwei Yang of Kyoto University and Zheng Chen of Osaka University aided dataset curation.

Funding came from NSF grants and JSPS, with no competing interests declared. Peer-reviewed by Chao Yan and others.

Path Forward for Trustworthy AI Tools

This platform shifts paradigms from solo LLM reliance to collaborative copilots. Researchers can now iteratively refine plans with AI, execute in integrated setups, slashing manual coding time. User study results (Fig. 5) show practical gains, with over 80% task completion.

While limited to 293 tasks and five users, the work scales: datasets from cBioPortal ensure real-world relevance. As npj Artificial Intelligence notes on LLMs in science, deep integration with human goals is key, backed by clear metrics.

Jimeng Sun, corresponding author, emphasizes supervision in acknowledgments. For insiders, BioDSBench sets a new standard—test your copilot before trusting it.

Maya Grant

Maya Grant specializes in health tech and reports on the systems behind modern business. They work through long‑form narratives grounded in real‑world metrics to make complex topics approachable. They frequently compare approaches across industries to surface patterns that travel well. Their perspective is shaped by interviews across engineering, operations, and leadership roles. They write about both the promise and the cost of transformation, including risks that are easy to overlook. They avoid buzzwords, focusing instead on outcomes, incentives, and the human side of technology. They are known for dissecting tools and strategies that improve execution without adding complexity. They frequently translate research into action for marketing teams, prioritizing clarity over buzzwords. They maintain a balanced tone, separating speculation from evidence. They explore how policies, markets, and infrastructure intersect to create second‑order effects. Readers appreciate their ability to connect strategic goals with everyday workflows. Outside of publishing, they track public datasets and industry benchmarks. They value transparency, practical advice, and honest uncertainty.

LEAVE A REPLY

Your email address will not be published