r/LocalLLaMA · July 2, 2026 · 5 min read

Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hey folks. I've been frustrated by how difficult it is to get an idea of how good each new model (or fine-tune) is, and I've not been satisfied with the one-off "draw a pelican riding a bike" style tests that we often fall back on. New models or model variants that can run locally on my RTX 3090 almost never get proper benchmark coverage from anyone but the folks who make them. Lately, I wanted to see how Ornith 35b compared to Qwen3.6 27b.

So I've been playing around with inspect-ai and a bunch of standard benchmarks that are available in their inspect-evals package. I'd like to be able to run a complete set of benchmarks on a new model overnight, and have some broad indication of how they compare in the morning. I'm not there yet, but I wanted to share the benchmarks I've run so far comparing Qwen3.6 27b (Q4_K_M), Gemma4 26B A4B QAT (Q4_0), and Ornith1.0 35B MoE (Q4_K_M). I am still running on LM Studio at the moment, so I ran the benchmarks below on lmstudio-community provided models, except Ornith, which I got from the deepreinforce-ai account.

TLDR

I tested all three on benchmarks with a limited number of samples (100) and aggressive limits. I expected Ornith to be nearly as good as Qwen3.6 27b at coding tasks, but not quite. I expected, as a fine tune, for it to be worse on general knowledge and grounding. But the final picture wasn't quite that clear. It was as-good or better than Qwen 27b in a little under half of cases, and worse the rest of the time. It claims to be best at agentic tasks though, and I haven't managed to successfully run most of the agentic benchmarks.

Specifics of each benchmark follow with some notes. And my thoughts on how painful it has been trying to run these benchmarks locally.

General Knowledge and Reasoning

Qwen takes the best (or joint best) score in 4 / 6 benchmarks.

Ornith takes the best (or joint best) in 3 / 6 benchmarks.

Something about the MMLU benchmark didn't like Gemma. It timed out in a lot of cases, but I haven't determined why. It could have been that it got stuck endlessly looping, or it could have been something to do with how I configured the tasks. Take the Gemma scored on these cases with a pinch of salt.

# Static knowledge and reasoning. success, logs = eval_set( tasks=[ gsm8k(), ifeval(), arc_easy(), arc_challenge(), mmlu_0_shot(cot=True), mmlu_5_shot(cot=True) ], log_dir="logs-know", **default_config, max_tokens=20000, )

Benchmark	Gemma4 26b	Qwen3.6 27b	Ornith1.0 35b
gsm8k	0.93	0.96	0.9
ifeval	0.93	0.95	0.91
arc_easy	1.0	1.0	0.98
arc_challenge	0.97	0.97	0.98
mmlu_0_shot	0.54	0.88	0.91
mmlu_5_shot	0.5	0.88	0.88

Grounding and Recall

Ornith takes lead on these, but Needle in a haystack (NIAH) had to be limited to 100000 max context because prompt processing times for Qwen made running a fair test at higher contexts prohibitively time-consuming. I need to find more convenient benchmarks for local testing, or simply re-run them with more time to spend.

# Grounding and recall success, logs = eval_set( tasks=[ drop(), niah(max_context=100000), ], log_dir="logs-ground", **default_config, max_tokens=40000, )

Benchmark	Gemma4 26b	Qwen3.6 27b	Ornith1.0 35b
drop	0.932	0.947	0.952
niah	10.0	10.0	10.0

Code generation and data science

This is where I expected Ornith to shine. It matched Qwen in 2 tasks out of four, but Qwen had the best score in every case. The scicode score was particularly disappointing. One positive over Gemma here, was that for me to get scicode working with Gemma I had to impose very heavy limits because it looped infinitely on most samples. Ornith didn't have that problem. Less infinite looping behavior.

# Code generation and data science success, logs = eval_set( tasks=[ ds1000(), class_eval(), scicode(), ifevalcode(samples_per_language=tasks_limit_per_eval // 10), # 10 languages ], log_dir="logs-code", **default_config, )

Benchmark	Gemma4 26b	Qwen3.6 27b	Ornith1.0 35b
DS-1000	0.34	0.66	0.48
class_eval	0.97	0.97	0.97
scicode	4.615	10.769	1.538
ifevalcode	0.03	0.00	0.03

Notes

Honestly, running these has been a bit of a nightmare. Gemma, in particular, had a tendency to loop infinitely. I had to re-configure and re-run the benchmarks with heavy limits to stop it from running forever. Additionally, prompt processing time one some of the tests was particularly bad. Changing some of these configs meant having to re-run the benchmarks all over for it to be a fair comparison against the other models.

My aim was to be able to run a full suite of tests over night, so I can have an idea of its capabilities in the morning. In reality, ifevalcode took 18 hours to run on its own with only 100 samples for Qwen3.6 27b. Here are some things I configured;

100 samples for each benchmark max.
Max token limits to stop looping. This really needed to be different for each benchmarks, as some genuinely seemed to need larger reasoning blocks.
Initially I set timeouts, but this really screwed things up while I was running multiple samples at once. One heavy task would use up all the resources while another times out without having been attempted.
1 task at a time, 1 connection max, 1 sandbox (docker instance) at a time.

I'm going to try switching these out and being more specific with my limits. I'm going to add sample shuffling (with a shared seed between models), and reduce the number of samples for some of the trickier tests. eval_sets in inspect-ai allow you to continue tests that stalled or ones you had to cancel. But, in reality this often meant that, when I needed to change configurations to get a benchmark working, I had to re-run the full set.

I may post some more once I have a more reliable benchmark setup. I hope some of you find this useful.

submitted by /u/Aggressive_Aspect436
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.