Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models?
I imagine not, and I'm trying to think why:
- marginal gains?
- pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)?
- scaling laws are not well understood for input-adaptive patching therefore big players do not bet on this?
or am I simply totally wrong and under the hood all the big players are doing dynamic tokenization for vision?
[link] [comments]
More from r/MachineLearning
-
Improving machine-translated novels via style transfer — looking for advice on the faithfulness/fluency tradeoff [P]
Jul 2
-
How papers are selected for Best Paper, Oral, or Highlight presentation at major ML/CV conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR? [D]
Jul 2
-
BMVC 2026 Review Discussion Thread [D]
Jul 2
-
Has anyone tried this approach with Fast Byte Latent Transformers ? [R]
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.