Evaluating LLMs as Expert Systems for Within-Host Malaria ModellingDashboard
Problem 1

Large Language Models are probabilistic text generators — not calculators.

When applied to within-host malaria modelling, base LLMs cannot reliably solve ODEs, estimate parameters, or produce numerically exact results. They approximate — and approximation is not enough for scientific inference.

Probabilistic outputs
LLMs sample tokens from probability distributions. They do not compute — they guess the most likely next word.
Numerical hallucination
When asked for a specific number, models confabulate plausible-sounding but wrong values with full confidence.
No execution guarantee
There is no deterministic solver behind the answer. The same prompt can yield different numbers on every run.
Problem 2

You cannot see what an LLM is doing — and you cannot iterate on its work.

A base LLM produces a single monolithic response. There is no intermediate state to inspect, no parameters to adjust, and no way to replay a specific computation step.

Base LLM
"The peak infection count is approximately 287 people at around day 42..."
  • No visible computation steps
  • Cannot adjust parameters and re-run
  • Different answer every time
Tool-Augmented System
1. Schema validated ✓
2. ODE solver executed (SUNDIALS) ✓
3. Peak extracted: I = 293.7 at t = 38 ✓
4. Artifacts generated ✓
  • Every step logged and inspectable
  • Adjust parameters, re-run instantly
  • Deterministic — same input, same output
Our Approach

A schema-first, tool-augmented runtime for compartmental modelling.

The LLM orchestrates validated tools — ODE solvers and Bayesian engines — instead of guessing answers. Every step is logged, every output is reproducible.

Live Demo

Base model vs. tool-augmented system.

What are the peak infections for a SEIR model with β = 0.3, σ = 0.2, γ = 0.1, N = 1000, I₀ = 10?
GPT-4 / Claude (base)
No tools, no solver — pure text generation
Based on my understanding of SEIR dynamics, with the given parameters (β = 0.3, σ = 0.2, γ = 0.1), the basic reproduction number R₀ = β/γ = 3.0, which indicates a significant outbreak. The peak infection count would be **approximately 280–320 individuals**, occurring around **day 35–45** of the epidemic. However, the exact timing and magnitude depend on the specific numerical integration method used. I'd estimate roughly **~300 people** at peak, around **day 40**.
ApproximateNon-deterministic
Model Builder

Parameters

population1000
beta0.48
sigma0.22
gamma0.18
Loading canvas...

Differential Equations

dS/dt = beta * S * I / population
dE/dt = beta * S * I / population sigma * E
dI/dt = sigma * E gamma * I
dR/dt = gamma * I
Build your model above and click Simulate