World and Human Action Model (WHAM)
📄 Paper • 🔗 Sample Data
data:image/s3,"s3://crabby-images/af7e4/af7e46e7a771cbbf48a9d9da85df7af3368cafd5" alt=""
data:image/s3,"s3://crabby-images/2a4a3/2a4a32392387cf3b00dfeaefa525c215d8334638" alt=""
data:image/s3,"s3://crabby-images/2b10d/2b10d6ab168a8e4adbdaf2a3c59f76b95ca8f561" alt=""
data:image/s3,"s3://crabby-images/0ad90/0ad90885998a93225af32daafa944c98c38e51ba" alt=""
data:image/s3,"s3://crabby-images/ed1b5/ed1b5f941566f6b8f143167860ad4bdf4ebc3301" alt=""
data:image/s3,"s3://crabby-images/2f216/2f21669258354e3170227b354d6db0a015d356e4" alt=""
data:image/s3,"s3://crabby-images/2703c/2703ceda5df6543ef44420d109d7d24ac06ecdb9" alt=""
data:image/s3,"s3://crabby-images/f5131/f51313a2e5620c800280d9b5852bb4e514506979" alt=""
data:image/s3,"s3://crabby-images/d3788/d3788ceb843e5e7eb13cb37b10febc9662fe3c71" alt=""
data:image/s3,"s3://crabby-images/98db4/98db4a837564fe826d58bfc2b05ecf5b33c06d12" alt=""
Muse is powered by a World and Human Action Model (WHAM), which is a generative model of gameplay (visuals and/or controller actions) trained on gameplay data of Ninja Theory’s Xbox game Bleeding Edge. Model development was informed by requirements of game creatives that we identified through a user study. Our goal is to explore the capabilities that generative AI models need to support human creative exploration. WHAM is developed by the Game Intelligence group at Microsoft Research, in collaboration with TaiX and Ninja Theory.
Model Card
WHAM is an autoregressive model that has been trained to predict (tokenized) game visuals and controller actions given a prompt. Prompts here can be either visual (one or more initial game visuals) and / or controller actions. This allows the user to run the model in (a) world modelling mode (generate visuals given controller actions), (b) behavior policy (generate controller actions given past visuals), or (c) generate both visuals and behavior.
WHAM consists of two components, an encoder-decoder VQ-GAN trained to encode game visuals to a discrete representation, and a transformer backbone trained to perform next-token prediction. We train both components from scratch. The resulting model can generate consistent game sequences, and shows evidence of capturing the 3D structure of the game environment, the effects of controller actions, and the temporal structure of the game (up to the model’s context length).
WHAM was trained on human gameplay data to predict game visuals and players’ controller actions. We worked with the game studio Ninja Theory and their game Bleeding Edge – a 3D, 4v4 multiplayer video game. From the resulting data we extracted one year’s worth of anonymized gameplay from 27,990 players, capturing a wide range of behaviors and interactions. A sample of this data is provided here
Model Details
Trained Models
In this release we provide the weights of two WHAM instances: 200M WHAM and 1.6B WHAM. Both have been trained from scratch on the same data set. 1.6B WHAM is evaluated in our paper. We additionally provide 200M WHAM as a more lightweight option for faster explorations.
- WHAM with 200M parameters, model size: 3.7GB
- WHAM with 1.6B parameters, model size: 18.9GB
Usage
System Requirements
The steps below have been tested on the following setup:
- Linux workstation with Ubuntu 20.04.4 LTS
- Windows 11 workstation running WSL2 with Ubuntu 20.04.6 LTS
The current setup assumes that a CUDA-supported GPU is available for model inference. This has been tested on systems with NVIDIA RTX A6000
and NVIDIA A100
respectively. In addition, approximately 15GB
of free hard disk space is required for dowmloading the models.
The steps under Installation assume a python 3.9 installation that can be
called using the command python3.9
and the venv package for creating virtual environments. If either of these is not present, you can install this version of python under Ubuntu using:
sudo apt install python3.9
sudo apt install python3.9-venv
If you are using the WHAM Demonstrator, please ensure that you have the required .NET Core Runtime. If this is not yet installed, an error message will pop up from which you can follow a link to download and install this package.
Installation
- Clone this repository. We recommend starting without the large model files, using
GIT_LFS_SKIP_SMUDGE=1 git clone [email protected]:microsoft/WHAM
cd WHAM
./setup_local.sh
This will set up a python3.9
virtual environment and install the required packages (this includes packages required for the model server). The typical install time should be approximately 5 minutes.
Run
source venv/bin/activate
whenever you want to run model inference or the model serverDownload model from this HuggingFace repository (See note below):
- Go to Files and versions and navigate to the
models
folder. - Download the model checkpoint. The instructions below assume that the model checkpoints have been downloaded to your local
models
folder.
- Go to Files and versions and navigate to the
Note: On Linux systems, you can use git clone
to clone the enire repository, including large files. Due to a limitation of git lfs
on Windows, only files up to 4GB
are supported and we recommend downloading the model files manually from the models
folder.
Local Model Inference
This section assumes that you have followed the installation steps above.
(Optional) Download sample data. For the local inference examples below, we recommend that you start with the tiny-sample
set of only 4 trajectories for your initial exploration.
You can now run model inference to generate gameplay sequences as follows:
python run_dreaming.py --model_path <path_to_checkpoint.ckpt> --data_path <path_to_sample_data_folder>
To run the 200M parameter (small) model (if you copied the tiny-sample folder to the root directory):
python run_dreaming.py --model_path models/WHAM_200M.ckpt --data_path tiny-sample
This uses the data in data_path
as initial prompt sequences. The script will create a dreaming_output
directory which will create two files per ground truth data file:
- An
.npz
file that contains a number of entries, most important of which are:encoded_decoded_ground_truth_images
: the original context images, encoded and decoded with the VQGAN.dreamt_images
: the sequence of all dreamt images.
- An
.mp4
file of the context data + dreamt images for easier viewing.
This requires approximately 4.5GB of VRAM on a single A6000, but only uses batch size of one. To speed up the process, increase batch size with --batch_size
argument. With a single A6000 and --batch_size 12
this uses approximately 30GB of VRAM. Generating gameplay sequences from the full 512 video dataset takes around 24 hours.
Please note that the first output from the script is generated when the first gameplay sequence has been generated. This may take several minutes when using an A6000
GPU, or longer for older generation GPUs.
See python run_dreaming.py --help
for different settings.
WHAM Demonstrator
Setting up the Model Server
We have tested the server code as provided on a single Linux machine with four A6000 GPUs
(large model) as well as on a Windows machine running Ubuntu under WSL2
, equipped with a single GeForce GTX 1080
(small model). Model inferences can be run on lower spec NVIDIA GPUs by reducing the batchsize.
The steps below assume that the installation steps above have been followed and that the model files have been downloaded to your local machine.
In your terminal, activate the newly installed virtual environment (if it isn't already):
source venv/bin/activate
Start the server, pointing it to the model:
python run_server.py --model <path_to_model_file>
To run the 200M parameter (small) model:
python run_server.py --model models/WHAM_200M.ckpt
To run the 1.6B parameter (large) model:
python run_server.py --model models/WHAM_1.6B_v1.ckpt
The server will start and by default listen on localhost port 5000 (this can be configured with --port <port>
).
Note: If you run out of VRAM when running the server, you can reduce the MAX_BATCH_SIZE
variable in run_server.py
.
Install the WHAM Demonstrator App (Windows only)
After cloning or downloading this repository, navigate to the folder wham/wham_demonstrator
, and start the Windows application WHAMDemonstrator.exe
within that folder.
Follow the instructions in the provided README.md within WHAM Demonstrator to connect to your model server and get an overview of supported functionality.
Intended Uses
This model and accompanying code are intended for academic research purposes only. WHAM has been trained on gameplay data from a single game, Bleeding Edge, and is intended to be used to generate plausible gameplay sequences resembling this game.
The model is not intended to be used to generate imagery outside of the game Bleeding Edge. Generated images include watermark and provenance metadata. Do not remove the watermark or provenance metadata..
WHAM can be used in multiple scenarios. The following list illustrates the types of tasks that WHAM can be used for:
- World Model: Visuals are predicted, given a real starting state and action sequence.
- Behaviour Policy: Given visuals, the model predicts the next controller action.
- Full Generation: The model generates both the visuals and the controller actions a human player might take in the game.
Training
Model
- Architecture: A decoder-only transformer that predicts the next token corresponding to an interleaved sequence of observations and actions. The image tokenizer is a VQ-GAN.
- Context length: 10 (observation, action) pairs / 5560 tokens
- Dataset size: The model was trained on data from approximately
500,000
Bleeding Edge games from all seven game maps (over 1 billion observation, action pairs 10Hz, equivalent to over 7 years of continuous human gameplay). A data sample is provided in bleeding-edge-gameplay-sample. This is the test data used for our evaluation results, and has the same format as the training data. - GPUs: 98xH100 GPUs
- Training time: 5 days
Software
Bias, Risks and Limitations
- The training data represents gameplay recordings from a variety of skilled and unskilled gameplayers, representing diverse demographic characteristics. Not all possible player characteristics are represented and model performance may therefore vary.
- The model, as it is, can only be used to generate visuals and controller inputs. Users should not manipulate images and attempt to generate offensive scenes.
Technical limitations, operational factors, and ranges
Model: - Trained on a single game, very specialized, not intended for image prompts that are out of context or from other domains - Limited context length (10s) - Limited image resolution (300px x 180px), the model can only generate images at this fixed resolution. - Generated images and controls can incorrect or unrecognizable. - Inference time is currently too slow for real-time use.
WHAM Demonstrator: - Developed as a way to explore potential interactions. This is not intended as a fully-fledged user experience or demo.
Models trained using game data may potentially behave in ways that are unfair, unreliable, or offensive, in turn causing harms. We emphasize that these types of harms are not mutually exclusive. A single model can exhibit more than one type of harm, potentially relating to multiple different groups of people. For example, the output of the model can be nonsensical or might look reasonable but is inaccurate with respect to external validation sources.
Although users can input any image as a starting point, the model is only trained to generate images and controller actions based on the structure of the Bleeding Edge game environment that it has learned from the training data. Out of domain inputs lead to unpredictable results. For example, this could include a sequence of images that dissolve into unrecognizable blobs .
Model generations when “out of scope” image elements are introduced will either:
- Dissolve into unrecognizable blobs of color.
- Morphed into game-relevant items such as game characters.
Evaluating WHAM
WHAM is evaluated based on its consistency, diversity, and persistency. Consistency is measured using Fréchet Video Distance (FVD), while diversity is assessed by comparing the marginal distribution of real human actions to those generated by the model using the Wasserstein distance. Persistency is tested using two scenarios: by adding a static power-up object to a game visual and by adding another player character to a game visual used for prompting the model. For detailed evaluation results, see the paper that introduces the model.
Responsible AI testing
WHAM has been tested with out of context prompt images to evaluate the risk of outputting harmful or nonsensical images. The generated image sequences did not retain the initial image, but rather dissolved into either unrecognizable blobs or to scenes resembling the training environment.
License
The model is licensed under the Microsoft Research License
this work has been funded by Microsoft Research
Privacy & Ethics Statement
Trademark Notice
Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
Contact Information
For questions please email to [email protected]