Worlds Biggest Chat Title Dataset From SupraLabs
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| If you search "Chat title dataset" on huggingface a few dys ago, the biggest chat title dataset you would get from it was "ogrnz/chat-titles", but recently at supralabs we have curated a 115K filtered dataset whih breaks the world record for the biggest dataset from 10k samples to 115k samples! We've released a set of chat title generation datasets that may be useful for instruction tuning, classification-style title generation, or benchmarking small models. The release includes both a filtered and an unfiltered version: - Filtered: `SupraLabs/chat-titles-filtered-115K` - Unfiltered: `SupraLabs/chat-titles-unfiltered-150K` - Legacy release: `SupraLabs/chat-titles-12K` The filtered version is the one we generally recommend for most training runs, while the unfiltered version is provided for anyone who prefers to apply their own cleaning and filtering pipeline. We're interested in hearing feedback from anyone who experiments with the datasets, especially regarding data quality, filtering approaches, and title generation performance across different model sizes. Questions, suggestions, and criticism are all welcome. [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.