Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \\wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.</p>\n","updatedAt":"2026-06-30T08:55:20.732Z","author":{"_id":"6430bdd8cd31d174a9f900fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg","fullname":"Ziyin Zhang","name":"Geralt-Targaryen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8766034841537476},"editors":["Geralt-Targaryen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg"],"reactions":[],"isReport":false}},{"id":"6a43faee57f6344bce3bd81c","author":{"_id":"66f98b8158b5ce30d5b3ba56","avatarUrl":"/avatars/8a7d12f39b7ec50d631ba7d890d85a01.svg","fullname":"wer","name":"reziiix","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-30T17:20:46.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! Any chance to try the model? Thanks\n","html":"<p>Hi! Any chance to try the model? Thanks</p>\n","updatedAt":"2026-06-30T17:20:46.543Z","author":{"_id":"66f98b8158b5ce30d5b3ba56","avatarUrl":"/avatars/8a7d12f39b7ec50d631ba7d890d85a01.svg","fullname":"wer","name":"reziiix","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7960163354873657},"editors":["reziiix"],"editorAvatarUrls":["/avatars/8a7d12f39b7ec50d631ba7d890d85a01.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.27708","authors":[{"_id":"6a437221763f63ca3757eb4b","name":"Siqiao Xue","hidden":false},{"_id":"6a437221763f63ca3757eb4c","name":"Chunxue Xu","hidden":false}],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval","submittedOnDailyBy":{"_id":"6430bdd8cd31d174a9f900fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg","isPro":false,"fullname":"Ziyin Zhang","user":"Geralt-Targaryen","type":"user","name":"Geralt-Targaryen"},"summary":"Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \\wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.","upvotes":3,"discussionId":"6a437222763f63ca3757eb4d","ai_summary":"A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.","ai_keywords":["vision-language encoder","fashion retrieval","full fine-tuning","knowledge distillation","weight interpolation","SigLIP2-base","LoRA","parameter-efficient fine-tuning","ground truth","benchmark evaluation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f52be103936020b9c83565","avatarUrl":"/avatars/e304a92ff3a13dc528e6af89359dfa68.svg","isPro":false,"fullname":"SQ","user":"pierrexsq","type":"user"},{"_id":"6430bdd8cd31d174a9f900fb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Y9SPnRfpKSbYc7MhNdP-H.jpeg","isPro":false,"fullname":"Ziyin Zhang","user":"Geralt-Targaryen","type":"user"},{"_id":"66ac9108f34f5779b22bf748","avatarUrl":"/avatars/c01686a7177069729c20ff090495112e.svg","isPro":false,"fullname":"chaos","user":"chaos-abab","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.27708.md","query":{}}">
ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval
Abstract
A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.
Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.
Community
Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.
Hi! Any chance to try the model? Thanks
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.27708 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.27708 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.27708 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.