High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving).
Current setup:
- Model: Gemma 4 26B (fine-tuned)
- Engine: vLLM
- Quantization: FP8
- Hardware: H100
Observed latency:
- TTFT: ~100–300 ms
- E2E latency: ~3–5 seconds
The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size.
I already experimented with vLLM’s n-gram speculative decoding, but honestly didn’t see meaningful gains.
Now I’m considering more serious speculative decoding approaches:
- EAGLE / Medusa-style methods
- Draft model based speculative decoding
- Possibly training a smaller Gemma draft model
Curious to hear from others who’ve worked with Gemma 4 or large distilled/fine-tuned models:
- Is this kind of latency expected?
- What actually moved the needle for you?
- Any bottlenecks I should investigate first before going deeper into speculative decoding?
Would love to hear experiences, benchmarks, or even horror stories :))
[link] [comments]
More from r/MachineLearning
-
How papers are selected for Best Paper, Oral, or Highlight presentation at major ML/CV conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR? [D]
Jul 2
-
BMVC 2026 Review Discussion Thread [D]
Jul 2
-
Has anyone tried this approach with Fast Byte Latent Transformers ? [R]
Jul 2
-
Books/Resources to improve mathematical foundations for ML research [D]
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.