Case Study
Oncology Software: Evaluation harness before features.
An oncology software company whose product team wanted to ship a treatment-summary feature. The clinical team wanted to know what happens when the model is wrong.
- Sector
- Clinical software
- Engagement
- Sprint → retainer
- Status
- Active
Situation
The product roadmap had three AI features dated for a Q2 release. The clinical advisory board had questions none of them could answer: How do we know when the model is wrong? How often is that? Who gets called when a summary misrepresents a trial protocol?
Engineering had a working prototype. Nobody had a clear answer to the clinical team’s questions.
What we did
- Recommended pausing all three features until the team could measure them.
- Designed and built an evaluation harness: a set of 1,400 historical oncology cases, annotated by two independent clinicians, scored along five clinically meaningful axes.
- Ran each proposed feature against the harness. Two failed a minimum-acceptable threshold the clinical director was willing to sign. One passed after a redesign of the prompt and retrieval stack.
- Wrote the team’s first internal standard for “What counts as ready to ship a model-assisted feature in a clinical product.”
Outcome
- One feature shipped to production in Q3 — three months later than the original plan, with a clinical sign-off neither the CTO nor the Chief Medical Officer had before.
- Two features were cut, permanently. Both were replaced with simpler deterministic tooling that cost the company nothing to maintain.
- Eval harness is now the default gate for any new model-touching feature. Retainer converted to ongoing advisory.