r/LocalLLaMA · · 3 min read

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Some backstory

I've been working on my local agent (openclaw), and I wanted to give it the skill to reconstruct calendar entries from a photo of the screen. I couldn't get at the calendar through an API (long story), so a photo was the only low-friction way to export the data.

What should have been an easy "skill building exercise" endet as a frustrating problem hunt. My agent went wrong more often than I expected: times off by 15-30 minutes, all entries 1h long no matter what, sometimes duplicate entries on neighboring days. When I complained about it to ChatGPT and Claude, they both kept telling me that reading a calendar is harder than humans assume. That peaked my interrest. I wanted to know if I could fix it with a different prompt, other tools or another quantization. I wanted to know where models actually stand today, and since I run things locally, I especially wanted to know how much accuracy I lose to quantization.

Before I knew it, I was building a comparison tool in form of a benchmark to measure the differences.

What is VCCB

VCCB (Visual Calendar Comprehension Benchmark) shows a model a fixed image of a calendar week view and asks it to extract every event as structured data: title, start, end/duration, overlaps, recurrence, all-day/multi-day spans. The same week is rendered in three desktop clients (Outlook, HCL Notes, Thunderbird - those are the ones I had access to) and shot three ways each — a clean screenshot, a frontal photo, and a ~15° perspective photo — so nine images per run.

Scores are self-normalized per client, because the rendering is lossy in different ways (Notes and Thunderbird enforce a minimum block height while Outlook uses an accent bar to show a short event's true start and length). I use a calendar app dependent "maximum extraction target" against which the results are scored. A flawless read is 100% regardless of client, and the perspective shots measure how much a model loses to capture distortion. Full method, scorer and answer key are in the repo. The images, prompts, scripts, the scorer and all results are open.

What I'm seeing so far (small sample, take with salt)

A rough four-class picture from my own runs:

  1. Humans: ~99% (±1%), and about the same on the perspective-distorted photos (eye+brain still has the edge)
  2. Frontier hosted models (e.g. Opus): ~80-85%
  3. Mid-tier (ChatGPT free): ~75% (±5)
  4. My local models — and, Claude Haiku: ~38-58%

That gap between human level and the local AI level is the reason I'm posting. I only have a handful of data points, and the question I care about most, "how much quantization actually costs you here", I can't answer on my own.

The ask to you

If you run models locally: please run the benchmark with whatever model and quant you actually use, and upload your submission. It's nine images, one isolated run per image, fill in a template, then open a PR or an issue. I score it centrally against the reference and it lands on the public leaderboard with your exact model and prompt attached, so anyone can reproduce it.
Btw.: The scoring and all is included in the package, so you can build a leaderboard of your LLMs, too. But I it would be great if you would share the data.
In theory you could instruct an agent to do the process, but I'm not so shure if the harness would share infos between runs and therefore effect the results.

I'm especially after quant comparisons of the same model (Q4 vs Q6 vs Q8, different GGUF builds, etc.) and the smaller VLMs people run day to day. Even one or two images helps — partial submissions are fine.

You can find the Repo here: https://github.com/KevinFleischer/vccbenchmark

Happy to answer anything about the design or the scoring in the comments, and if you hit a bug running it, tell me and I'll fix it.

submitted by /u/Gold-Drag9242
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA