Model Card for Model ID

This model is a LoRA adapter for image editing, as presented in Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models. It's designed to be used with the base model leloy/Anole-7b-v0.1-hf.

Model Details

Model Description

Developed by: [Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, and Edoardo M. Ponti]
Shared by: [Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, and Edoardo M. Ponti]
Model type: LoRA adapter for image-to-image generation
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model [optional]: leloy/Anole-7b-v0.1-hf

Model Sources [optional]

Repository: https://github.com/dmis-lab/Monet
Paper [optional]: https://huggingface.co/papers/2506.06006
Demo [optional]: [More Information Needed]

Uses

Direct Use

Image editing.

Out-of-Scope Use

The model is not intended for use cases that involve generating malicious content.

Bias, Risks, and Limitations

The model may exhibit biases present in the training data.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

Please see https://github.com/dmis-lab/Monet for sample usage.

Training Details

Training Data

The model was trained on a combination of synthetic data generated from a dynamics model and a small amount of real-world data.

Training Procedure

Preprocessing [optional]

The training data was preprocessed by tokenizing the trajectories and computing weights based on importance scores from a recognition model.

Training Hyperparameters

Training regime: bfloat16 mixed precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

AURORA-Bench

Factors

Real-world and synthetic subsets of AURORA-Bench

Metrics

GPT4o-as-judge, human evaluation

Results

The model achieves performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge.

Environmental Impact

Hardware Type: A100
Hours used: Unknown
Cloud Provider: Unknown
Compute Region: Unknown
Carbon Emitted: Unknown

Technical Specifications [optional]

Model Architecture and Objective

The model is based on a vision-and-language foundation model fine-tuned to acquire a dynamics model through supervision.

Compute Infrastructure

Hardware

A100 GPUs

Citation [optional]

BibTeX:

@misc{qiu2025bootstrapping,
      title={Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models}, 
      author={Yifu Qiu and Yftah Ziser and Anna Korhonen and Shay B. Cohen and Edoardo M. Ponti},
      year={2025},
      eprint={2506.06006},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Framework versions

PEFT 0.13.0

yfqiu-nlp
/

chameleon-world-model-aurora-bootstrap