Case Study
Oncology Software: Evaluation harness before features.
An oncology software company whose product team wanted to ship a treatment-summary feature. The clinical team wanted to know what happens when the model is wrong.
- Sector
- Clinical software
- Engagement
- Sprint → retainer
- Status
- Active
Situation
The product roadmap had three AI features dated for a Q2 release. The clinical advisory board had questions none of them could answer: How do we know when the model is wrong? How often is that? Who gets called when a summary misrepresents a trial protocol?
Engineering had a working prototype. Nobody had a clear answer to the clinical team’s questions.
What we did
- Recommended pausing all three features until the team could measure them.
- Designed and built an evaluation harness: a set of 1,400 historical oncology cases, annotated by two independent clinicians, scored along five clinically meaningful axes.
- Ran each proposed feature against the harness. Two failed a minimum-acceptable threshold the clinical director was willing to sign. One passed after a redesign of the prompt and retrieval stack.
- Wrote the team’s first internal standard for “What counts as ready to ship a model-assisted feature in a clinical product.”
Outcome
- One feature shipped to production in Q3 — three months later than the original plan, with a clinical sign-off neither the CTO nor the Chief Medical Officer had before.
- Two features were cut, permanently. Both were replaced with simpler deterministic tooling that cost the company nothing to maintain.
- Eval harness is now the default gate for any new model-touching feature. Retainer converted to ongoing advisory.
If this looks like the kind of problem you’re sitting on, an intro call is thirty minutes and costs nothing. We’ll tell you if we’re the right firm. If we’re not, we’ll often know who is.