In current ML systems, where is the main bottleneck: dataset quality or model architecture improvements? [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
A lot of recent progress in ML appears to come from scaling existing architectures rather than introducing fundamentally new ones.
At the same time, there’s increasing emphasis on dataset quality, curation, and synthetic data pipelines.
In practice, I’m trying to understand how this tradeoff looks in real systems:
How much effort is typically spent on data cleaning and filtering vs model design??
Whether dataset quality improvements still yield larger gains compared to architectural changes??
How synthetic data is affecting training stability and generalization in practice??
In many applied settings, it seems like data constraints become the limiting factor before architecture does, but I’m not sure if that’s broadly true across domains.
[link] [comments]
More from r/MachineLearning
-
How papers are selected for Best Paper, Oral, or Highlight presentation at major ML/CV conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR? [D]
Jul 2
-
BMVC 2026 Review Discussion Thread [D]
Jul 2
-
Has anyone tried this approach with Fast Byte Latent Transformers ? [R]
Jul 2
-
Books/Resources to improve mathematical foundations for ML research [D]
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.