r/LocalLLaMA · June 20, 2026 · 1 min read

Worlds Biggest Chat Title Dataset From SupraLabs

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

If you search "Chat title dataset" on huggingface a few dys ago, the biggest chat title dataset you would get from it was "ogrnz/chat-titles", but recently at supralabs we have curated a 115K filtered dataset whih breaks the world record for the biggest dataset from 10k samples to 115k samples!

SupraLabs

We've released a set of chat title generation datasets that may be useful for instruction tuning, classification-style title generation, or benchmarking small models.

The release includes both a filtered and an unfiltered version:

- Filtered: `SupraLabs/chat-titles-filtered-115K`

- Unfiltered: `SupraLabs/chat-titles-unfiltered-150K`

- Legacy release: `SupraLabs/chat-titles-12K`

The filtered version is the one we generally recommend for most training runs, while the unfiltered version is provided for anyone who prefers to apply their own cleaning and filtering pipeline.

We're interested in hearing feedback from anyone who experiments with the datasets, especially regarding data quality, filtering approaches, and title generation performance across different model sizes.

Questions, suggestions, and criticism are all welcome.

submitted by /u/Time-Toe-1276
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA