Custom image encoder [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hello, I would like to know whether building my own image encoder would be a good idea instead of using models like CLIP, SigLIP/SigLIP2, or DINO.
My use case is video frame classification.
My pipeline is the following: the client sends me a video stream, sampled at 1 frame per 1 or 2 second, forming segments of 15 frames (30 seconds). I compute embeddings for these frames and send them to a small custom Transformer (1.5M to 9M parameters).
This works very well on GPU. However, I have two main constraints: processing speed and deployment on small CPU-only devices.
A CLIP-S0 encoder processes around 10 images per second on 4 vCPUs. I would like to replace it with my own encoder trained on my dataset (a few million images), with only a few million parameters and around 4 to 5 labels.
My question is whether this is a good approach, and whether it would improve both embedding generation speed and the accuracy of my Transformer model.
[link] [comments]
More from r/MachineLearning
-
Improving machine-translated novels via style transfer — looking for advice on the faithfulness/fluency tradeoff [P]
Jul 2
-
How papers are selected for Best Paper, Oral, or Highlight presentation at major ML/CV conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR? [D]
Jul 2
-
BMVC 2026 Review Discussion Thread [D]
Jul 2
-
Has anyone tried this approach with Fast Byte Latent Transformers ? [R]
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.