Hugging Face Daily Papers · June 30, 2026 · 4 min read

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \\wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.</p>\n","updatedAt":"2026-06-30T08:55:20.732Z","author":{"_id":"6430bdd8cd31d174a9f900fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg","fullname":"Ziyin Zhang","name":"Geralt-Targaryen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8766034841537476},"editors":["Geralt-Targaryen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg"],"reactions":[],"isReport":false}},{"id":"6a43faee57f6344bce3bd81c","author":{"_id":"66f98b8158b5ce30d5b3ba56","avatarUrl":"/avatars/8a7d12f39b7ec50d631ba7d890d85a01.svg","fullname":"wer","name":"reziiix","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-30T17:20:46.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! Any chance to try the model? Thanks\n","html":"<p>Hi! Any chance to try the model? Thanks</p>\n","updatedAt":"2026-06-30T17:20:46.543Z","author":{"_id":"66f98b8158b5ce30d5b3ba56","avatarUrl":"/avatars/8a7d12f39b7ec50d631ba7d890d85a01.svg","fullname":"wer","name":"reziiix","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7960163354873657},"editors":["reziiix"],"editorAvatarUrls":["/avatars/8a7d12f39b7ec50d631ba7d890d85a01.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.27708","authors":[{"_id":"6a437221763f63ca3757eb4b","name":"Siqiao Xue","hidden":false},{"_id":"6a437221763f63ca3757eb4c","name":"Chunxue Xu","hidden":false}],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval","submittedOnDailyBy":{"_id":"6430bdd8cd31d174a9f900fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg","isPro":false,"fullname":"Ziyin Zhang","user":"Geralt-Targaryen","type":"user","name":"Geralt-Targaryen"},"summary":"Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \\wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.","upvotes":3,"discussionId":"6a437222763f63ca3757eb4d","ai_summary":"A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.","ai_keywords":["vision-language encoder","fashion retrieval","full fine-tuning","knowledge distillation","weight interpolation","SigLIP2-base","LoRA","parameter-efficient fine-tuning","ground truth","benchmark evaluation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f52be103936020b9c83565","avatarUrl":"/avatars/e304a92ff3a13dc528e6af89359dfa68.svg","isPro":false,"fullname":"SQ","user":"pierrexsq","type":"user"},{"_id":"6430bdd8cd31d174a9f900fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg","isPro":false,"fullname":"Ziyin Zhang","user":"Geralt-Targaryen","type":"user"},{"_id":"66ac9108f34f5779b22bf748","avatarUrl":"/avatars/c01686a7177069729c20ff090495112e.svg","isPro":false,"fullname":"chaos","user":"chaos-abab","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.27708.md","query":{}}">

Papers

arxiv:2606.27708

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Published on Jun 26

· Submitted by

Ziyin Zhang on Jun 30

Upvote

Authors:

Abstract

A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct