I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8 frontier models on 300 synthetic doctor-patient dialogues. Each model wrote a SOAP note for every dialogue. Then I used a 4-model judge panel to score the notes for:
The main result: Across 2,400 generated notes, the models produced:
So in this benchmark, omissions were much more common than hallucinations. Some other things that stood out:
The repo includes the transcripts, outputs, scoring scripts, and leaderboard (for link see comments). The next thing I’m interested in is running the same evaluation on models that can run locally. Separately, we also used this benchmark internally for product development. The obvious follow-up was: if a cheap/open model writes well but misses safety facts, can a transcript-grounded wrapper recover those omissions and flag unsupported claims? That direction looks promising. In particular, it makes models like DeepSeek much more interesting: strong prose, low cost, and potentially usable in safer clinical-note pipelines when paired with a safety layer. Earlier evaluation (V1) post can be found here. [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.