Rethinking Problems in AI, Robotics, and Human Behavior understanding at the Frontier of Wearable Devices

Richard Newcombe

Meta Reality Labs Research, USA

Abstract

Wearable devices are the most intimate sensing and compute platform humans have ever created and can bridge the gap of all useful context from our physical lives with the AI revolution. These devices are continuously co-located with the body, privy to our motion, gaze, physiology, voice, vision and may enable the most complete understanding of how and why we do what we do.
The potential for new research within the paradigm of egocentric machine perception spans Contextualized AI, Robotics as well as providing a new platform for Social science research. The application spaces are clear from AI glasses to home robotics and breakthroughs with personal and societal scale analysis for a multitude of areas from health to education - understanding how we do what we do lies at the heart of a frontier of research spanning these areas.
Yet our research framing in egocentric and embodied AI still largely inherits the assumptions of an earlier era: discrete tasks, short episodic use, and general offline supervised machine learning pipelines or hand tuned specialized algorithms designed to solve narrow-AI tasks. I will argue that always-on, body-worn sensing and the resulting necessity of on-device computation to achieve latency, power and privacy requirements, forces us to re-pose our problems and look at the new opportunities present with always-on embodied AI, not merely shrink old ones onto smaller hardware.
I will ground the talk with live demonstrations of Meta’s Project Aria Gen 2 AI glasses research platform. We’ll look at the hardware and software co-design of the sensors that show what may be possible with future wearable devices and the resulting datasets. We’ll look at custom silicon and architectures designed for running SLAM, eye and hand tracking, and a multitude of other algorithms for understanding personal state from speaker diarization, heart rate and emotion analysis, to state-of-the-art approaches to 3D semantic scene understanding designed for operation with egocentric data and spanning cloud and real time, on-device inference. We’ll contrast current and future multimodal AI architectures that attempt to trade off the opportunity for on-device near and on-sensor inference with the power of cloud computing and look at the implications for privacy and the opportunity for distributed sensing and computation.
We’ll also reframe the central problem for the future of contextual AI and Robotics in the wild, and at scale. Once sensing goes into the wild multi-day, multi-person, always-on dataset scale becomes a constraint rather than an asset: the lived world is too long, too varied, and too personal to label and train against retrospectively. The decisive question is no longer how much can we capture, but what is worth capturing and annotating, given the value of the downstream task?
Given that the downstream task is now nothing less than general artificial and super human embodied intelligence, that coupling between downstream utility and the cost of supervision for offline ML is where the hard constraints and the real opportunities lie. We return to the question of if there is a complete representation of the sensory data manifold beyond simply capturing the raw data with a form of generalized compression.
We find the opportunity points toward online, always-on learning that adapts continuously and in-context, and beyond that toward a compelling possibility: a shared, persistent model of reality, built and kept current by the devices themselves as they move through the world. Ultimately we see that the frontier of contextual AI, robotics, and human-behavior understanding is not incremental gains in the current generation of offline trained algorithms, but rethinking what a continuously sensing, continuously learning system is for given the real world constraints of always-on and embodied intelligent systems.