From Molmo to Muse Spark: Data centric recipes for state of the art multimodal models
Aniruddha Kembhavi
Meta, UK
Abstract
The dominant recipe for multimodal AI -- scale a large model on web scraped data -- is hitting diminishing returns. In this talk I will argue that the next frontier lies not in bigger models but in smarter data. Molmo and Pixmo demonstrated that a carefully crafted dataset could propel an open model to parity with the best proprietary systems across vision language benchmarks. More recently, Muse Spark showed that a rebuilt data and training pipeline can match frontier multimodal capabilities. Both efforts underscore a common principle: data quality and diversity are the primary drivers of performance. I will then look ahead to the emerging paradigm of agentic data collection -- systems that autonomously identify gaps in their training distribution, and curate new examples -- and discuss how this closes the loop between model capability and data quality, fundamentally changing how we build multimodal AI.
Facebook group
Twitter
Computer Vision for Spatial and Physical Intelligence