Open P2P | A Foundation Model for Gaming Agents

Abstract

Behavior cloning is enjoying a resurgence in popularity as scaling both model and data sizes proves to provide a strong starting point for many tasks of interest. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all 8300+ hours of high quality human gameplay, training and inference code, and pretrained checkpoints under an open license. We show that our best model is capable of playing a variety of 3D video games at a level competitive with human play.
We use this recipe to systematically examine the scaling laws of behavior cloning to understand how the model's performance and causal reasoning varies with model and data scale. We first show in a simple toy problem that, for some types of causal reasoning, increasing both the amount of training data and the depth of the network results in the model learning a more causal policy. We then systematically study how causality varies with the number of parameters (and depth) and training steps in scaled models of up to 1.2 billion parameters, and we find similar scaling results to what we observe in the toy problem.

Dataset

We release a large-scale dataset of high-quality human gameplay spanning diverse 3D video games including FPS (DOOM, Quake, Call of Duty, etc.), racing (Need for Speed, etc.), Roblox games, and other popular video games. All gameplay is recorded at 20 FPS by experienced players. Each frame is annotated with keyboard and mouse actions, and text instructions are provided when available.

8,300+

Hours of Gameplay

650M+

Image-Action Pairs

40+

Game Titles

Example gameplay sequence with aligned action and text annotations. Keyboard actions are simplified to WASD; arrows indicate mouse movement.

Policy Model

Our model, Pixels2Play (P2P), is an action policy that takes visual observations and optional text instructions as input to output keyboard and mouse actions. The model is designed for high-speed, real-time inference (20 Hz) on consumer-level GPUs. The architecture is composed of a backbone transformer and a lightweight action decoder. The backbone transformer is responsible for sophisticated spatio-temporal reasoning between the visual inputs, text inputs, and the output action. The action decoder then predicts the final mouse and keyboard actions based on a compressed action prediction token generated by the backbone. This structure accelerates inference speed by a factor of 5 while maintaining high prediction accuracy.

The model employs an EfficientNet-based image encoder to compress visual observations into compact visual tokens, and a Gemma text encoder to compress text instructions into compact text tokens. One ground truth action consists of eight tokens: four representing simultaneous keyboard actions, two representing mouse movement on the x and y axes, and two representing mouse button actions. These ground truth action tokens are provided as input so the model can leverage prior actions to perform more like a human. To maintain the causality of the model, we designed a customized attention mask to ensure that the action prediction token only attends to prior ground truth action tokens.

(a) Architecture of P2P. The core policy transformer and action decoder are both decoder-only transformers. Each timestep begins with a text token. Since many frames do not contain a text annotation there is a default text token used on these frames. This is followed by one image token from video frame followed by a learnable "reasoning" token that grants the model extra computation. The policy transformer then outputs a single action prediction token. A smaller transformer, the action decoder, then auto-regressively transforms and samples the single action prediction token into the full action space. Then the true action tokens are input so that action prediction token at time t+1 can attend to the true action prediction tokens from time t.

(b) Attention mask used in our transformer policy (green denotes 1 and gray 0). This custom mask ensures the action prediction token at time t cannot attend to the ground truth action at time t. Note that no other tokens attend to action prediction token to stabilize the training process.

Evaluation

We present one set of the gameplay videos we used for evaluation in paper Section 4.2. We compare 150M, 300M, 600M and 1.2B model on nine game checkpoints, then use human judgement to evaluate the performance of the model.

Model Size Gameplay Comparison

The 1.2B model shows a general consistent improvement over the smaller models in all nine cases.

We then show the instruction following capacity of the model, use 1.2B model as an example. We also show the importance of action conditioning which is the motivation of using a action-conditioned model architecture.

Instruction Following

Press the red button No instruction

Action Conditioning

Condition on prior action Do not condition on prior action

By conditioning the model on prior actions, Open P2P maintains temporal coherence and learns more human-like behaviors. In contrast, when the model does not incorporate prior actions, it tends to take actions continuously, which deviates from natural human behavior. Human players typically act at a lower frequency, for example, they may maintain the same direction for several seconds or pause briefly to reassess before acting. As a result, the action-conditioned model is able to capture these temporal patterns, leading to smoother control and more realistic, human-like behavior.

We also demonstrate that the model is capable of following text instructions. As shown in the video, the player must press three consecutive red buttons to open the door and enter the next room. However, the action of pressing a red button can be subtle: many annotators press the button while turning around, rather than executing a clearly distinguishable button-press action. As a result, this behavior makes it difficult for the action model to infer that pressing the buttons is necessary to complete the maze, since it primarily mimics the annotators’ trajectories.
Consequently, without any text instruction, the model presses all three buttons in only about 20% of the trials.
In contrast, when provided with text instructions that specify the objective, the model can better infer the intent of the trajectory and learn a more effective policy. As shown, when the model is prompted to press the red buttons, it successfully presses all three buttons in approximately 80% of the trials, representing a substantial improvement over the model without text instruction.
These results highlight the importance of text instructions for learning effective action policies. However, incorporating text instructions requires additional text encoding, which reduces the model’s inference speed (FPS).

Causality and Scaling

A central challenge in behavior cloning is causal confusion—the tendency for models to rely on spurious correlations (like UI elements or previous actions) rather than the true environmental causes of an action. We investigate the relationship between model scale and causal reasoning through both controlled toy environments and large-scale experiments.

1. Causality in a Controlled Toy Environment

(a) The agent must distinguish between the causal feature (obstacle) and a correlated distractor (previous brake light).

(b) Learning curves demonstrating that increased network depth accelerates the discovery of causally correct policies.

In this controlled setup, an optimal linear policy exists that can solve the task. However, we found that standard gradient descent fails to find this solution in linear models. By increasing network depth and adding non-linearity, the optimization process is better able to overcome spurious correlations. These results suggest that increased capacity and depth do not just improve performance—they actively facilitate the discovery of true causal signals during training.

2. Empirical Evidence in Large-Scale Environments

Causality score increases with model size and dataset size

Figure 4: Causal reasoning scores improve significantly as both model capacity and training data volume increase.

In our full-scale experiments, we observed an empirical phenomenon that mirrors the findings from the toy example.

We found that increasing model parameters and dataset volume naturally mitigates causal confusion. Even without explicit architectural interventions to address causality, larger models demonstrate a superior ability to distinguish between essential environmental cues and non-causal distractors. This suggests that scale provides a practical, robust solution to the causal challenges inherent in generalist gaming agents.

BibTeX

@misc{yue2026scaling,
      title={Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing}, 
      author={Yuguang Yue and Irakli Salia and Samuel Hunt and Chris Green and Wenzhe Shi and Jonathan J. Hunt},
      year={2026},
      eprint={2601.04575},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.04575}
}

Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing