Generative AI

NVIDIA AI Researchers Release NitroGen: An Open Concept Base Model for Generalist Gaming Agents

The NVIDIA AI research team released NitroGen, an open concept base model for casual game agents that learn to play commercial games directly with pixels and gamepad actions using online video at scale. NitroGen is trained on 40,000 hours of gameplay across over 1,000 games and comes with an open dataset, a universal simulator, and a pre-trained policy.

Internet-scale video action dataset

The NitroGen pipeline starts from publicly available gameplay videos that include an input overlay, for example a streamed gamepad visualization that places you in the corner of the screen. The research team collects 71,000 hours of raw video with such holes, and then uses quality filtering based on action density, which leaves 55% of the data, about 40,000 hours, covering more than 1,000 games.

The selected dataset contains 38,739 videos from 818 creators. Distribution covers many topics. There are 846 games with more than 1 hour of data, 91 games with more than 100 hours, and 15 games with more than 1,000 hours each. Action RPGs took up 34.9 percent of the hours, platformers 18.4 percent, and action adventure titles 9.2 percent, with the rest spread across games, roguelikes, racing and other genres.

To recover frame-level actions from raw streams, NitroGen uses a three-step extraction pipeline. First, the template matching module localizes the controller overlay using around 300 controller templates. For each video, the system samples 25 frames and matches the SIFT and XFeat features between the frames and templates, then estimates an affine transformation where at least 20 inliers support the same. This displays the controller circuit output for all frames.

Second, the SegFormer model based on hybrid classification segmentation analyzes control plants. The model takes two spatially interpolated sequential frames and outputs the joystick positions in an 11 by 11 grid and binary button states. It was trained on 8 million synthetic images provided with different control templates, opacities, sizes and compression settings, using AdamW with a learning rate of 0.0001, a weight decay of 0.1, and a cluster size of 256.

Third, the pipeline filters the joystick positions and filters the lower activity segments. Joystick coordinates were normalized to a range from −1.0 to 1.0 using the 99th percentile of absolute x and y values ​​to reduce outliers. Chunks where less than 50 percent of time steps have non-zero actions are removed, which avoids predicting a null action during policy training.

A separate benchmark with low-truth controller logs shows that joystick predictions reach an R² average of 0.84 and button frame accuracy up to 0.96 across major controller families such as Xbox and PlayStation. This ensures that the automatic annotations are accurate enough to capture large-scale behavior.

Universal simulator and benchmark for many games

NitroGen includes a universal emulator that wraps commercial Windows games in a Gymnasium-compatible interface. The wrapper captures the game engine's system clock to control simulation time and provides frame-by-frame support without changing the game code, for any title that uses the system clock for physics and interactions.

Observed in this benchmark are single RGB frames. Actions are defined as a composite controller area with 16 binary vectors of gamepad buttons, four pad buttons, four face buttons, two shoulder buttons, two thumbsticks, thumb buttons for joystick, start and back, and a continuous 4 vector of joy points, left and right x,y. This integrated structure allows direct transfer of a single policy across multiple games.

The test suite includes 10 commercial games and 30 activities. There are 5 2D games, 3 side scrollers and 2 top-down roguelikes, and 5 3D games, 2 open world games, combat-oriented action RPGs and 1 sports title. The missions fall into 11 combat missions, 10 navigation missions, and 9 game-specific missions with custom objectives.

Construction of the NitroGen model

NitroGen's base policy follows the GR00T N1 pattern for composite agent properties. It discards the language and state embeddings, and keeps the vision encoder and the single action header. The input is a single RGB frame at 256 by 256 resolution. The SigLIP 2 vision transformer includes a 256 image token frame.

The diffusion transformer, DiT, produces 16 components for future actions. During training, noisy action fragments are embedded by a multi-layer perceptron into action tokens, processed by a number of DiT blocks with attention to themselves and visual tokens, and then decoded back into continuous action vectors. The purpose of the training is to conform to the conditional flow with 16 steps to distinguish between each 16 action sequence.

The extracted test area has 4.93 × 10^8 parameters. The model card defines the output as a 21 by 16 tensor, where dimension 17 corresponds to the binary button states and dimension 4 stores two two-dimensional joystick vectors, over the next 16 times. This display is compatible with the integrated action space, right down to the reshaping of the joystick parts.

Training outcomes and transfer benefits

NitroGen is trained only on a large scale behavior on the Internet video dataset. There is no reinforcement learning and no reward design in the baseline model. Image enhancements include random brightness, contrast, saturation, color, minimal rotation, and random crops. Training uses AdamW with a weight decay of 0.001, a steady-state decay learning rate schedule with a constant phase of 0.0001, and a dynamic dynamic range of weights with a decay of 0.9999.

After pre-training on the full dataset, the NitroGen 500M already achieves zero-shot completion rates for all games in the benchmark. Average completion rates remain in the range from about 45 percent to 60 percent across combat, navigation and certain game tasks, and in two- and three-dimensional games, despite internet monitoring noise.

To transfer to abstract games, the research team manages the topic, pre-trains with the remaining data, and then fine-tunes the game held under fixed data and calculates the budget. In an isometric roguelike, optimization from NitroGen provides a relative improvement rate of about 10 percent compared to training from scratch. In a three-dimensional action RPG, the average profit is about 25 percent, and in some combat operations in a low data system, 30 hours, the relative improvement reaches 52 percent.

Key Takeaways

  • NitroGen is the base model for the general concept of games: It maps 256×256 RGB frames directly to gamepad actions and is trained to synthesize pure behavior in an online game, without reinforcement learning.
  • The dataset is large scale and automatically labeled from the control overlays: NitroGen uses 40,000 hours of filtered gameplay from 38,739 videos in over 1,000 games, where frame-level actions are extracted from the virtual controller layer using a SegFormer-based analysis pipeline.
  • The controller's integrated action slot allows cross-game transfer: Actions are represented in a shared space of approximately 20 dimensions per time step, including binary gamepad buttons and continuous joystick vectors, allowing a single policy to be applied across multiple commercial Windows games using a universal Gymnasium-style simulator.
  • The policy of the distributed converter corresponding to the conditional flow: A 4.93 × 10^8 parameter model using a SigLIP 2 vision encoder and a DiT-based action head trained along a conditional flow on 16-step chukkers, achieves robust control from noisy web-scale data.
  • Training on NitroGen improves the performance of the sport below: If the subjects are properly analyzed under the same data and calculation budget, the implementation based on NitroGen produces consistent relative gains, about 10 percent to 25 percent on average and up to 52 percent in combat operations of low data, compared to training from the beginning.

Check it out Paper again Model here. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button