Movement is essential to intelligence. The cycle of perceiving and acting is key to our understanding of the physical world around us. Indoor navigation, moving household items, or operating a vehicle in traffic; These seemingly trivial tasks hide a complex choreography of our sensory-motor abilities in plain sight. We do this with an internal model of the world, which is constantly updated with new information, and with which we make predictions about the future as we navigate safely through our environment and perform tasks.
Pioneers of AI refer to this as spatial intelligence. [1]
Similarly, contemporary spatial intelligence also aims at building world models [2, 12], and directly inferring laws of physics from multi-modal sensor observations without any need for hand-crafted rules.
Fig 1: Kuka Iiwa (Left ), TurtleBot 2 (Middle), WidowX (Right), Source: Open-X Embodiment
Fig 2: Multi-modal data visualization with Nutron (Source: Yaak)
With recent advances in large multi-modal models, AI is at the cusp of going beyond browsers and into the physical world. This presents its own set of challenges around safety requirements for spatial intelligence.
Unlike vision-language models (VLMMs [3] ), which are trained on content from the web, interoperability of multi-modal robotics datasets is undercut by a plethora of platforms, sensors, and vendors (Fig: 1 & 2). This requires rethinking tools and workflows for building and validating spatial intelligence.
Multi-modal datasets
Robotics datasets are coupled deeply to the embodiment they are collected from, and display large variation in sensor configuration and modalities [5]. They capture rich information about their environment as well as expert policies for competing tasks. Table 1, shows a few publicly available robotics datasets as well as Yaak's (Niro100-HQ), their observation and action space size, and sensor configurations.
→ Scroll to see all columns
#cams (RGB)
6
2
2
8
1
1
#cams (depth)
-
1
1
-
1
-
#Actions
7
7
5
5
13
13
Calibrated
yes
no
no
yes
no
no
Proprioception
yes
yes
no
yes
yes
yes
policy
scripted
Human Spacemouse
scripted
Expert
human VR
Expert
control hz
30
5
3
50
3
20
Table 1: Variation in observation (cams + proprioceptions) and action (controls) space
This presents an open question: Do we develop a unique model for each embodiment and environment, or train a joint model [9] where different environments and tasks help the model learn representations transferable between embodiments. Recent research [5, 6, 7] hints that the latter strategy is the one that yields better AI models.
Applying this paradigm beyond research [9, 10] on enterprise scale datasets, like Yaak's (Fig 3), presented us with new challenges in dataset visualization, search, curation and dev-tools for end-to-end AI for robotics. Below we highlight some of the challenges we faced while working with large-scale, multi-modal datasets in the automotive domain.
Recent open-source efforts from the 🤗 team [4] (LeRobot) has lowered the barrier to entry into end-to-end AI for robotics research and education for early adopters of end-to-end spatial intelligence.
Developer tools gap
The last 3+ years in our partnership with driving schools, we've collected real-world, multi-modal datasets in 30 cities. Our dataset contains both expert (driving instructors) and student policies (learner drivers).
Fig 3: 100K+ hours of multi-modal sensor data and expert / student policies
As our dataset grew, we quickly ran into multiple bottlenecks before we could work on spatial intelligence. We found that the landscape of developer tools for end-to-end AI in robotics was fragmented, closed-sourced, or nonexistent. Below, we outline a few of the issues we ran into.
As end-to-end AI becomes the dominant paradigm in spatial intelligence [7, 11], including Yaak's first use case in the automotive industry, there is an unmet need for enterprise tools and workflows needed to build it; that is unified, intuitive, and scalable with safety validation at its core. In part 2 of the blog series, we'll share how we built our spatial intelligence platform to succeed at these challenges (Fig 4).
Fig 4: Attention heat map of Yaak's spatial intelligence trained entirely through self-supervision from observations (RGB cameras) and actions (vehicle controls)
References
Spatial intelligence Ted Talk: Fei-Fei Li
VLMM: Vision-Language multimodal models
LeRobot: Making AI for Robotics more accessible with end-to-end learning
Open-X dataset: Open-X Robotics Datasets
Open-X Embodiment: Robotic Learning Datasets and RT-X Models
Multi embodiment: Scaling up learning across many different robot types
rerun.io: Visualize multimodal data over time
Gato: A Generalist Agent
Trajectory transformers: Offline RL as One Big Sequence Modeling Problem
SMART: A Generalized Pre-training Framework for Control Tasks
VISTA: A Generalizable Driving World Model with High Fidelity and Versatile Controllability