Compartmental AI Workbench

Evaluating LLMs as Expert Systems for Within-Host Malaria ModellingDashboard

Problem 1

Large Language Models are probabilistic text generators — not calculators.

When applied to within-host malaria modelling, base LLMs cannot reliably solve ODEs, estimate parameters, or produce numerically exact results. They approximate — and approximation is not enough for scientific inference.

Probabilistic outputs

LLMs sample tokens from probability distributions. They do not compute — they guess the most likely next word.

Numerical hallucination

When asked for a specific number, models confabulate plausible-sounding but wrong values with full confidence.

No execution guarantee

There is no deterministic solver behind the answer. The same prompt can yield different numbers on every run.

Problem 2

You cannot see what an LLM is doing — and you cannot iterate on its work.

A base LLM produces a single monolithic response. There is no intermediate state to inspect, no parameters to adjust, and no way to replay a specific computation step.

Base LLM

→ "The peak infection count is approximately 287 people at around day 42..."

No visible computation steps
Cannot adjust parameters and re-run
Different answer every time

Tool-Augmented System

1. Schema validated ✓

2. ODE solver executed (SUNDIALS) ✓

3. Peak extracted: I = 293.7 at t = 38 ✓

4. Artifacts generated ✓

Every step logged and inspectable
Adjust parameters, re-run instantly
Deterministic — same input, same output

Our Approach

A schema-first, tool-augmented runtime for compartmental modelling.

The LLM orchestrates validated tools — ODE solvers and Bayesian engines — instead of guessing answers. Every step is logged, every output is reproducible.

User Prompt

Natural language modelling request

LLM Agent

Orchestrates tool calls and reasoning

Schema Validator

Validates compartmental model structure

ODE Solver

SUNDIALS deterministic integration

Stan HMC

Bayesian parameter estimation

Artifact Store

Plots, tables, traces, provenance

Technical Report

Reproducible, inspectable output

Live Demo

Base model vs. tool-augmented system.

What are the peak infections for a SEIR model with β = 0.3, σ = 0.2, γ = 0.1, N = 1000, I₀ = 10?

GPT-4 / Claude (base)

No tools, no solver — pure text generation

Based on my understanding of SEIR dynamics, with the given parameters (β = 0.3, σ = 0.2, γ = 0.1), the basic reproduction number R₀ = β/γ = 3.0, which indicates a significant outbreak. The peak infection count would be **approximately 280–320 individuals**, occurring around **day 35–45** of the epidemic. However, the exact timing and magnitude depend on the specific numerical integration method used. I'd estimate roughly **~300 people** at peak, around **day 40**.

ApproximateNon-deterministic

Model Builder

Parameters

population1000

beta0.48

sigma0.22

gamma0.18

Observable

Dataset ID

Loading canvas...

Differential Equations

dS/dt = −beta * S * I / population

dE/dt = beta * S * I / population − sigma * E

dI/dt = sigma * E − gamma * I

dR/dt = gamma * I

Build your model above and click Simulate