Quality evaluation of quants with limited time or tokens
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3 of this model is better than q6 of that model.
I've checked a few benchmarks, but it seems they are versatile, and the models may generate millions of tokens, which, with a 300b+ moe model on a home setup of 10-20 t/s seems to be not feasible to benchmark. I'd rather have a benchmark where I could limit the focus to the tasks that provide the most predictive power (e.g. tasks that may pass on q6 but may fail on q5).
Of course there is always the DIY approach, but I am wondering if people have already tackled this problem somehow. I'd even settle if there were an automatic way to describe that q5 is roughly 95.56% of q8, or something along those lines.
[link] [comments]
More from r/LocalLLaMA
-
What's in your RAG?
Jul 2
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.