Model Card for Model ID
This model is a LoRA adapter for image editing, as presented in Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models. It's designed to be used with the base model leloy/Anole-7b-v0.1-hf.
Model Details
Model Description
- Developed by: [Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, and Edoardo M. Ponti]
- Shared by: [Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, and Edoardo M. Ponti]
- Model type: LoRA adapter for image-to-image generation
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model [optional]: leloy/Anole-7b-v0.1-hf
Model Sources [optional]
- Repository: https://github.com/dmis-lab/Monet
- Paper [optional]: https://huggingface.co/papers/2506.06006
- Demo [optional]: [More Information Needed]
Uses
Direct Use
Image editing.
Out-of-Scope Use
The model is not intended for use cases that involve generating malicious content.
Bias, Risks, and Limitations
The model may exhibit biases present in the training data.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
Please see https://github.com/dmis-lab/Monet for sample usage.
Training Details
Training Data
The model was trained on a combination of synthetic data generated from a dynamics model and a small amount of real-world data.
Training Procedure
Preprocessing [optional]
The training data was preprocessed by tokenizing the trajectories and computing weights based on importance scores from a recognition model.
Training Hyperparameters
- Training regime: bfloat16 mixed precision
Evaluation
Testing Data, Factors & Metrics
Testing Data
AURORA-Bench
Factors
Real-world and synthetic subsets of AURORA-Bench
Metrics
GPT4o-as-judge, human evaluation
Results
The model achieves performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge.
Environmental Impact
- Hardware Type: A100
- Hours used: Unknown
- Cloud Provider: Unknown
- Compute Region: Unknown
- Carbon Emitted: Unknown
Technical Specifications [optional]
Model Architecture and Objective
The model is based on a vision-and-language foundation model fine-tuned to acquire a dynamics model through supervision.
Compute Infrastructure
Hardware
A100 GPUs
Citation [optional]
BibTeX:
@misc{qiu2025bootstrapping,
title={Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models},
author={Yifu Qiu and Yftah Ziser and Anna Korhonen and Shay B. Cohen and Edoardo M. Ponti},
year={2025},
eprint={2506.06006},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Framework versions
- PEFT 0.13.0
- Downloads last month
- 21
Model tree for yfqiu-nlp/chameleon-world-model-aurora-bootstrap
Base model
leloy/Anole-7b-v0.1-hf