Model Card for Model ID

This model is a LoRA adapter for image editing, as presented in Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models. It's designed to be used with the base model leloy/Anole-7b-v0.1-hf.

Model Details

Model Description

  • Developed by: [Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, and Edoardo M. Ponti]
  • Shared by: [Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, and Edoardo M. Ponti]
  • Model type: LoRA adapter for image-to-image generation
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model [optional]: leloy/Anole-7b-v0.1-hf

Model Sources [optional]

Uses

Direct Use

Image editing.

Out-of-Scope Use

The model is not intended for use cases that involve generating malicious content.

Bias, Risks, and Limitations

The model may exhibit biases present in the training data.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

Please see https://github.com/dmis-lab/Monet for sample usage.

Training Details

Training Data

The model was trained on a combination of synthetic data generated from a dynamics model and a small amount of real-world data.

Training Procedure

Preprocessing [optional]

The training data was preprocessed by tokenizing the trajectories and computing weights based on importance scores from a recognition model.

Training Hyperparameters

  • Training regime: bfloat16 mixed precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

AURORA-Bench

Factors

Real-world and synthetic subsets of AURORA-Bench

Metrics

GPT4o-as-judge, human evaluation

Results

The model achieves performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge.

Environmental Impact

  • Hardware Type: A100
  • Hours used: Unknown
  • Cloud Provider: Unknown
  • Compute Region: Unknown
  • Carbon Emitted: Unknown

Technical Specifications [optional]

Model Architecture and Objective

The model is based on a vision-and-language foundation model fine-tuned to acquire a dynamics model through supervision.

Compute Infrastructure

Hardware

A100 GPUs

Citation [optional]

BibTeX:

@misc{qiu2025bootstrapping,
      title={Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models}, 
      author={Yifu Qiu and Yftah Ziser and Anna Korhonen and Shay B. Cohen and Edoardo M. Ponti},
      year={2025},
      eprint={2506.06006},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Framework versions

  • PEFT 0.13.0
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yfqiu-nlp/chameleon-world-model-aurora-bootstrap

Adapter
(6)
this model