650+ Apache-2.0 biomedical NER/de-id models that run on-device in MLX. Same fp32 weights, identical outputs: the clinical NER models run 30-40x faster than PyTorch-CPU on a 3-year-old M3 Max. Repro inside.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Disclosure first: I maintain OpenMed, so read this with that bias. I'm posting the numbers with the full methodology and a runnable script so you can reproduce or tear it apart. I'm here for the next couple of hours to answer methodology questions. What it is: an open-source clinical/biomedical NER project. 1,000+ models on Hugging Face, all Apache 2.0, and the What's new: 410 new MLX builds, bringing it to 650+ total. They run on macOS via MLX and on iPhone/iPad via OpenMedKit (open Swift package). The NER paper is arXiv 2508.01630 (SOTA across 12 public datasets, per-dataset tables inside, judge them yourself). On-device speed, methodology first. Same model, MLX on Apple Silicon vs PyTorch on CPU, same fp32 precision, byte-identical entity outputs (parity-checked). On a 3-year-old MacBook Pro M3 Max, the clinical NER models run 30-40x faster on MLX: a 434M biomedical NER is 27 ms (MLX) vs ~1080 ms (CPU) at fp32, same weights, identical entities. The reason is architectural, not a precision trick: these are deberta-v2 models whose disentangled attention is O(n^2) and very slow on CPU, while the Apple GPU handles it easily. It is input- and model-dependent, so a smaller model on short text is single-digit-x, not 30x. The second clip in the video is the PII de-identification model redacting on-device; the point there is privacy, identifiers are stripped locally and nothing leaves the machine.
iPhone note: I'm not claiming 36 ms is a phone number, it's the M3 Max. The phone story is "these run via OpenMedKit". Everything's public: models (Apache 2.0 HF), SDK (Apache 2.0 GitHub), paper (arXiv 2508.01630). Ask me anything on the parity check, the dtype story, or the dataset numbers. [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.