Skip to content
Stealthy Good
Case Study

Oncology Software: Evaluation harness before features.

An oncology software company whose product team wanted to ship a treatment-summary feature. The clinical team wanted to know what happens when the model is wrong.

Sector
Clinical software
Engagement
Sprint → retainer
Status
Active
Situation

The product roadmap had three AI features dated for a Q2 release. The clinical advisory board had questions none of them could answer: How do we know when the model is wrong? How often is that? Who gets called when a summary misrepresents a trial protocol?

Engineering had a working prototype. Nobody had a clear answer to the clinical team’s questions.

What we did
  1. Recommended pausing all three features until the team could measure them.
  2. Designed and built an evaluation harness: a set of 1,400 historical oncology cases, annotated by two independent clinicians, scored along five clinically meaningful axes.
  3. Ran each proposed feature against the harness. Two failed a minimum-acceptable threshold the clinical director was willing to sign. One passed after a redesign of the prompt and retrieval stack.
  4. Wrote the team’s first internal standard for “What counts as ready to ship a model-assisted feature in a clinical product.”
Outcome
  • One feature shipped to production in Q3 — three months later than the original plan, with a clinical sign-off neither the CTO nor the Chief Medical Officer had before.
  • Two features were cut, permanently. Both were replaced with simpler deterministic tooling that cost the company nothing to maintain.
  • Eval harness is now the default gate for any new model-touching feature. Retainer converted to ongoing advisory.