We posit that with large, autoregressive models robotics will benefit from e2e models. In this blog, we take the idea of next word prediction given a partial sentence to next action prediction given the context (e.g. image, sensor data). We found that a simple formulation doesn't work, and requires modality-specific changes to GPTs for them to learn effective, cross-modal attention and embeddings.