Hugging Face Daily Papers · · 7 min read

Little Brains, Big Feats: Exploring Compact Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6415cb01486c7c9a5d1560f3/XUuB9TpgTCVHVj6FyXZCR.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6415cb01486c7c9a5d1560f3/XUuB9TpgTCVHVj6FyXZCR.png\" alt=\"overview\"></a><br>Can small language models be strong enough for practical RAG generation without GPUs?</p>\n<p>We benchmark 17 compact language models from 1B to 8B parameters as generators in Russian-language Retrieval-Augmented Generation systems. All candidate models were evaluated as local GGUF variants, including Q4_K_M and Q5_K_M quantized models, under CPU-only inference constraints.</p>\n<p>The evaluation uses a 500-sample benchmark built from five Russian-language QA datasets, including open-source and proprietary domain-specific data. Responses are assessed with a multi-judge LLM-as-a-Judge setup across correctness, answer relevance, faithfulness, context relevance, and latency.</p>\n<p>A clear pattern emerges: Qwen-family models dominate the top-performing SLM tier in this setting. Qwen3-8B-Q4_K_M achieved the strongest overall SLM quality, reaching 0.72 correctness and 0.83 faithfulness, approaching the GPT-5-mini baseline on correctness. At the same time, Qwen3-4B-Instruct-2507-Q5_K_M provided the best practical quality–latency trade-off, with 0.71 correctness, 0.89 answer relevance, 0.80 faithfulness, and substantially lower CPU latency than the 8B model. Qwen2.5-7B-Instruct-Q4_K_M was also a strong candidate, showing high answer relevance and faithfulness with moderate latency.</p>\n<p>Our findings suggest that carefully selected quantized SLMs, especially from the Qwen family, can be competitive RAG generators while enabling local, private, and GPU-free deployment. The work is especially relevant for on-device AI, privacy-sensitive applications, edge deployment, and production RAG systems with limited compute budgets.</p>\n<p>Accepted to ECML PKDD 2026 Applied Data Science Track. Author’s preprint version.</p>\n","updatedAt":"2026-07-01T06:18:36.503Z","author":{"_id":"6415cb01486c7c9a5d1560f3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6415cb01486c7c9a5d1560f3/tKQPhr1o-1SLSvKg3un6J.jpeg","fullname":"Roman Derunets","name":"rmndrnts","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":3,"identifiedLanguage":{"language":"en","probability":0.8737108707427979},"editors":["rmndrnts"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6415cb01486c7c9a5d1560f3/tKQPhr1o-1SLSvKg3un6J.jpeg"],"reactions":[],"isReport":false}},{"id":"6a45c331552a4777f03ef948","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:47:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)](https://huggingface.co/papers/2605.14488) (2026)\n* [A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations](https://huggingface.co/papers/2605.27444) (2026)\n* [NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models](https://huggingface.co/papers/2606.27047) (2026)\n* [Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)](https://huggingface.co/papers/2606.05901) (2026)\n* [MeMo: Memory as a Model](https://huggingface.co/papers/2605.15156) (2026)\n* [OCC-RAG: Optimal Cognitive Core for Faithful Question Answering](https://huggingface.co/papers/2606.00683) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.14488\">Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.27444\">A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.27047\">NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.05901\">Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15156\">MeMo: Memory as a Model</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.00683\">OCC-RAG: Optimal Cognitive Core for Faithful Question Answering</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:47:29.915Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7350123524665833},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.30062","authors":[{"_id":"6a43818f763f63ca3757eb8e","name":"Dari Baturova","hidden":false},{"_id":"6a43818f763f63ca3757eb8f","user":{"_id":"66728993cc71f0dce43fb93a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66728993cc71f0dce43fb93a/3Lf8ltxKSDOhbQxOCt-cE.jpeg","isPro":false,"fullname":"Bruches Elena","user":"brucheselena","type":"user","name":"brucheselena"},"name":"Elena Bruches","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:45:49.943Z","hidden":false},{"_id":"6a43818f763f63ca3757eb90","user":{"_id":"66ee8de6fcafe599797e0e4b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ee8de6fcafe599797e0e4b/KMl05Ov2scFauzk1u-T32.png","isPro":false,"fullname":"Ivan","user":"doesitworkornot","type":"user","name":"doesitworkornot"},"name":"Ivan Chernov","status":"claimed_verified","statusLastChangedAt":"2026-07-01T09:50:33.137Z","hidden":false},{"_id":"6a43818f763f63ca3757eb91","user":{"_id":"6415cb01486c7c9a5d1560f3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6415cb01486c7c9a5d1560f3/tKQPhr1o-1SLSvKg3un6J.jpeg","isPro":false,"fullname":"Roman Derunets","user":"rmndrnts","type":"user","name":"rmndrnts"},"name":"Roman Derunets","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:45:51.980Z","hidden":false},{"_id":"6a43818f763f63ca3757eb92","name":"Arsenii Fomin","hidden":false},{"_id":"6a43818f763f63ca3757eb93","user":{"_id":"686df19b799a169a5e05db55","avatarUrl":"/avatars/52a39a35d4d910bcfe3decf8b92fc038.svg","isPro":false,"fullname":"Andrey Kostin","user":"sequut","type":"user","name":"sequut"},"name":"Andrey Kostin","status":"claimed_verified","statusLastChangedAt":"2026-07-01T09:50:36.426Z","hidden":false}],"publishedAt":"2026-06-29T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"Little Brains, Big Feats: Exploring Compact Language Models","submittedOnDailyBy":{"_id":"6415cb01486c7c9a5d1560f3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6415cb01486c7c9a5d1560f3/tKQPhr1o-1SLSvKg3un6J.jpeg","isPro":false,"fullname":"Roman Derunets","user":"rmndrnts","type":"user","name":"rmndrnts"},"summary":"While large language models have been dominating the research landscape recently, small language models remain highly relevant across various domains; yet, they receive far less attention. In this study, we investigate how smaller language models perform during the generation stage within a Retrieval-Augmented Generation (RAG) system. To benchmark these models effectively, we utilised both open-source and proprietary datasets covering diverse subject areas and question types. Our findings demonstrate that a RAG system with small language models can be executed directly on-device without requiring any GPU hardware within a reasonable time. The experimental code and links to the supplementary materials can be accessed through the GitHub repository: https://github.com/SibNN/SLM-RAG-EVAL.","upvotes":10,"discussionId":"6a43818f763f63ca3757eb94","githubRepo":"https://github.com/SibNN/SLM-RAG-EVAL","githubRepoAddedBy":"user","ai_summary":"Small language models can effectively perform retrieval-augmented generation tasks directly on-device without GPU acceleration.","ai_keywords":["Retrieval-Augmented Generation","RAG","small language models","large language models","open-source datasets","proprietary datasets"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"6a44acd68a555b85139fd279","name":"SibNN","fullname":"Siberian Neuronets LLC","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a44ab3b19230a1050f4e975/2w8WWSCj_H34RcxGPSah5.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6415cb01486c7c9a5d1560f3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6415cb01486c7c9a5d1560f3/tKQPhr1o-1SLSvKg3un6J.jpeg","isPro":false,"fullname":"Roman Derunets","user":"rmndrnts","type":"user"},{"_id":"66728993cc71f0dce43fb93a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66728993cc71f0dce43fb93a/3Lf8ltxKSDOhbQxOCt-cE.jpeg","isPro":false,"fullname":"Bruches Elena","user":"brucheselena","type":"user"},{"_id":"660da14e620d024d7062fc7f","avatarUrl":"/avatars/e37a0a6a44469c3bfe94b537e49b6192.svg","isPro":false,"fullname":"Vika_kondrashuk","user":"Kondrashuk","type":"user"},{"_id":"6a44ab3b19230a1050f4e975","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/R0B8T-COP81euBBQx3ti-.jpeg","isPro":false,"fullname":"Siberian Neuronets","user":"SiberianNeuronets","type":"user"},{"_id":"62b1e0f76a5435fd9a60a8dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655824626110-noauth.png","isPro":true,"fullname":"Ivan Bondarenko","user":"bond005","type":"user"},{"_id":"697c8b15a7f796854ef333c4","avatarUrl":"/avatars/94de3a736fac914944f1b57609e3819a.svg","isPro":false,"fullname":"Joel Wang","user":"joelhenwang","type":"user"},{"_id":"66b1ce4ca14db5aac3e5e755","avatarUrl":"/avatars/ab55ef112fba091813e1cc1f43857cf9.svg","isPro":false,"fullname":"Valentin Malykh","user":"madrugado","type":"user"},{"_id":"63ac5701c21e60a3e9b58aa7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ac5701c21e60a3e9b58aa7/g6EX7diOpuA94R2ab-rZC.png","isPro":true,"fullname":"Dipankar Sarkar","user":"dipankarsarkar","type":"user"},{"_id":"63cb976d80ba2ca4151b67a2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675713278440-63cb976d80ba2ca4151b67a2.jpeg","isPro":false,"fullname":"Daniel Grebenkin","user":"dangrebenkin","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a44acd68a555b85139fd279","name":"SibNN","fullname":"Siberian Neuronets LLC","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a44ab3b19230a1050f4e975/2w8WWSCj_H34RcxGPSah5.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.30062.md","query":{}}">
Papers
arxiv:2606.30062

Little Brains, Big Feats: Exploring Compact Language Models

Published on Jun 29
· Submitted by
Roman Derunets
on Jul 1

Abstract

Small language models can effectively perform retrieval-augmented generation tasks directly on-device without GPU acceleration.

While large language models have been dominating the research landscape recently, small language models remain highly relevant across various domains; yet, they receive far less attention. In this study, we investigate how smaller language models perform during the generation stage within a Retrieval-Augmented Generation (RAG) system. To benchmark these models effectively, we utilised both open-source and proprietary datasets covering diverse subject areas and question types. Our findings demonstrate that a RAG system with small language models can be executed directly on-device without requiring any GPU hardware within a reasonable time. The experimental code and links to the supplementary materials can be accessed through the GitHub repository: https://github.com/SibNN/SLM-RAG-EVAL.

Community

Paper author Paper submitter about 20 hours ago
edited about 20 hours ago

overview
Can small language models be strong enough for practical RAG generation without GPUs?

We benchmark 17 compact language models from 1B to 8B parameters as generators in Russian-language Retrieval-Augmented Generation systems. All candidate models were evaluated as local GGUF variants, including Q4_K_M and Q5_K_M quantized models, under CPU-only inference constraints.

The evaluation uses a 500-sample benchmark built from five Russian-language QA datasets, including open-source and proprietary domain-specific data. Responses are assessed with a multi-judge LLM-as-a-Judge setup across correctness, answer relevance, faithfulness, context relevance, and latency.

A clear pattern emerges: Qwen-family models dominate the top-performing SLM tier in this setting. Qwen3-8B-Q4_K_M achieved the strongest overall SLM quality, reaching 0.72 correctness and 0.83 faithfulness, approaching the GPT-5-mini baseline on correctness. At the same time, Qwen3-4B-Instruct-2507-Q5_K_M provided the best practical quality–latency trade-off, with 0.71 correctness, 0.89 answer relevance, 0.80 faithfulness, and substantially lower CPU latency than the 8B model. Qwen2.5-7B-Instruct-Q4_K_M was also a strong candidate, showing high answer relevance and faithfulness with moderate latency.

Our findings suggest that carefully selected quantized SLMs, especially from the Qwen family, can be competitive RAG generators while enabling local, private, and GPU-free deployment. The work is especially relevant for on-device AI, privacy-sensitive applications, edge deployment, and production RAG systems with limited compute budgets.

Accepted to ECML PKDD 2026 Applied Data Science Track. Author’s preprint version.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.30062
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.30062 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.30062 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.30062 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers