Behavior cloning is enjoying a resurgence in popularity as scaling both model and data sizes proves to provide a strong starting point for many tasks of interest. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all 8300+ hours of high quality human gameplay, training and inference code, and pretrained checkpoints under an open license. We show that our best model is capable of playing a variety of 3D video games at a level competitive with human play.
We use this recipe to systematically examine the scaling laws of behavior cloning to understand how the model's performance and causal reasoning varies with model and data scale. We first show in a simple toy problem that, for some types of causal reasoning, increasing both the amount of training data and the depth of the network results in the model learning a more causal policy. We then systematically study how causality varies with the number of parameters (and depth) and training steps in scaled models of up to 1.2 billion parameters, and we find similar scaling results to what we observe in the toy problem.
Watch our AI agent compete against a real human player in real-time gameplay. Watch it to the end to see it exhibit real human like behaviour of switching to side-arm when it runs out of ammo
We release a large-scale dataset of high-quality human gameplay spanning diverse 3D video games including FPS (DOOM, Quake, Call of Duty, etc.), racing (Need for Speed, etc.), Roblox games, and other popular video games. All gameplay is recorded at 20 FPS by experienced players. Each frame is annotated with keyboard and mouse actions, and text instructions are provided when available.
Example gameplay sequence with aligned action and text annotations. Keyboard actions are simplified to WASD; arrows indicate mouse movement.
Our model, Pixels2Play (P2P), is an action policy that takes visual observations and optional text instructions as input to output keyboard and mouse actions. The model is designed for high-speed, real-time inference (20 Hz) on consumer-level GPUs. The architecture is composed of a backbone transformer and a lightweight action decoder. The backbone transformer is responsible for sophisticated spatio-temporal reasoning between the visual inputs, text inputs, and the output action. The action decoder then predicts the final mouse and keyboard actions based on a compressed action prediction token generated by the backbone. This structure accelerates inference speed by a factor of 5 while maintaining high prediction accuracy.
The model employs an EfficientNet-based image encoder to compress visual observations into compact visual tokens, and a Gemma text encoder to compress text instructions into compact text tokens. One ground truth action consists of eight tokens: four representing simultaneous keyboard actions, two representing mouse movement on the x and y axes, and two representing mouse button actions. These ground truth action tokens are provided as input so the model can leverage prior actions to perform more like a human. To maintain the causality of the model, we designed a customized attention mask to ensure that the action prediction token only attends to prior ground truth action tokens.
(a) Architecture of P2P. The core policy transformer and action decoder are both decoder-only transformers. Each timestep begins with a text token. Since many frames do not contain a text annotation there is a default text token used on these frames. This is followed by one image token from video frame followed by a learnable "reasoning" token that grants the model extra computation. The policy transformer then outputs a single action prediction token. A smaller transformer, the action decoder, then auto-regressively transforms and samples the single action prediction token into the full action space. Then the true action tokens are input so that action prediction token at time t+1 can attend to the true action prediction tokens from time t.
(b) Attention mask used in our transformer policy (green denotes 1 and gray 0). This custom mask ensures the action prediction token at time t cannot attend to the ground truth action at time t. Note that no other tokens attend to action prediction token to stabilize the training process.
We present one set of the gameplay videos we used for evaluation in paper Section 4.2. We compare 150M, 300M, 600M and 1.2B model on nine game checkpoints, then use human judgement to evaluate the performance of the model.
The 1.2B model shows a general consistent improvement over the smaller models in all nine cases.
A central challenge in behavior cloning is causal confusion—the tendency for models to rely on spurious correlations (like UI elements or previous actions) rather than the true environmental causes of an action. We investigate the relationship between model scale and causal reasoning through both controlled toy environments and large-scale experiments.
(a) The agent must distinguish between the causal feature (obstacle) and a correlated distractor (previous brake light).
(b) Learning curves demonstrating that increased network depth accelerates the discovery of causally correct policies.
In this controlled setup, an optimal linear policy exists that can solve the task. However, we found that standard gradient descent fails to find this solution in linear models. By increasing network depth and adding non-linearity, the optimization process is better able to overcome spurious correlations. These results suggest that increased capacity and depth do not just improve performance—they actively facilitate the discovery of true causal signals during training.
Figure 4: Causal reasoning scores improve significantly as both model capacity and training data volume increase.
In our full-scale experiments, we observed an empirical phenomenon that mirrors the findings from the toy example.
We found that increasing model parameters and dataset volume naturally mitigates causal confusion. Even without explicit architectural interventions to address causality, larger models demonstrate a superior ability to distinguish between essential environmental cues and non-causal distractors. This suggests that scale provides a practical, robust solution to the causal challenges inherent in generalist gaming agents.
@misc{yue2026scaling,
title={Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing},
author={Yuguang Yue and Irakli Salia and Samuel Hunt and Chris Green and Wenzhe Shi and Jonathan J. Hunt},
year={2026},
eprint={2601.04575},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.04575}
}