Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
We built a monokernel that runs the full decode sequence as one GPU-resident program on AMD MI300X, with some neat optimizations. The die topology is central to the result, we map memory access patterns to the physical layout, compute units group by their associated IOD, and the hardware runs at its full design performance.
Up to 3,300 output tokens/s per request, batch size 1, no speculative decoding, no quantization, on 8x MI300X.
This preview runs a small 2B coding model, and we plan to support large frontier MoE in the future.
Technical deep dive: https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus
Try it: https://playground.kog.ai
[link] [comments]
More from r/MachineLearning
-
Improving machine-translated novels via style transfer — looking for advice on the faithfulness/fluency tradeoff [P]
Jul 2
-
How papers are selected for Best Paper, Oral, or Highlight presentation at major ML/CV conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR? [D]
Jul 2
-
BMVC 2026 Review Discussion Thread [D]
Jul 2
-
Has anyone tried this approach with Fast Byte Latent Transformers ? [R]
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.