Instella-T2I✨: Pushing the Limits of 1D Discrete Latent Space Image Generation

teaser — **Figure 1.** Text-to-image diffusion and auto-regressive generation samples by Instella-T2I.

Intella-T2I v0.1 is the first text-to-image model for the Instella model family. The model is trained using the AMD Instinct™ MI300X. Thanks to the large VRAM of the AMD Instinct™ MI300X accelerator and the compact 1D binary latent space adopted in Instella-T2I v0.1, we can fit 4096 images into a single computation node with 8 AMD Instinct™ MI300X accelerators, achieving a training throughput of over 220 images per second on each GPU. We are able to complete the model training within 200 MI300X GPU days.

Training Instella-T2I from scratch on AMD Instinct MI300X GPUs demonstrates the platform’s capability and scalability for a broad range of AI workloads, including computationally intensive text-to-image diffusion models.

Model Architecture

Instella-T2I introduces 1D binary latent space that employs a fully Transformer architecture to encode images into discrete 1D binary latent spaces for fast, high-fidelity image generation. Unlike traditional 2D latent grids that allocate uniform capacity and suffer from redundancy, our 1D tokenizer removes spatial constraints and compresses information more efficiently. Each of the 128 latent tokens represents a 512×512 image using 128 binary elements, enabling an 8× reduction in token count compared to standard VQ-VAEs.

1d binary tokenizer — **Figure 2.** Architecture of the 1D binary tokenizer.

Instella-T2I image generation models employ a dual-stream Transformer architecture: a frozen AMD OLMo-1B decoder-only language model extracts text features, while a separate Transformer stream generates images by conditioning on aligned text features via joint self-attention. We release both diffusion and auto-regressive generative models, using bidirectional and causal attention, respectively, with learnable position embeddings to account for the non-spatial nature of 1D tokens.

Model Training

The 1D binary latent image tokenizer is trained exclusively using the LAION-COCO dataset.

The training of the image generation models adopts a two-stage recipe. In stage one, the model is pretrained using the LAION-COCO dataset. In stage two, the data is augmented with synthetic image–text pairs, with a raio of 3:1 between the LAION and the synthetic data. The synthetic data consists of data from Dalle-1M and images generated from public models. We generate one image per prompt for each model, using the models' default hyperparameters and captions from DiffusionDB. In summary, we use the following datasets and open models to construct our training data:

Dataset/Model	License
LAION-COCO	Creative Common CC-BY 4.0
Dalle-1M	MIT
DiffusionDB	CC0 1.0 License
FLUX.1-schnell	Apache license 2.0
Stable Diffusion 3.5 Large Turbo	Stability Community License
Stable Diffusion 3.5 Medium	Stability Community License
PixArt-Sigma	Open RAIL++-M

▶️ Running the Models

First install PyTorch according to the instructions specific to your operating system. For AMD GPUs, you can aslo start with a rocm/pytorch docker.

To install the recommended packages, run:

git clone https://github.com/AMD-AIG-AIMA/Instella-T2I.git
cd Instella-T2I
# install Flash-Attention on MI300X
GPU_ARCH=gfx942 MAX_JOBS=$(nproc) pip install git+https://github.com/Dao-AILab/flash-attention.git -v
# install other dependencies
pip install -r requirements.txt

Using the test_diff.py and test_ar.py provided in the Github repository to run image generation in interactive mode for the diffusion and AR models.

The inference scripts will automatically download the checkpoints to path specified by --ckpt_path.

python test_diff.py --ckpt_path DESIRED_PATH_TO_MODELS
python test_ar.py --ckpt_path DESIRED_PATH_TO_MODELS

Specifying hyperparameters

To specify hyperparameters, run:

python test_diff.py \
    --ckpt_path DESIRED_PATH_TO_MODELS \
    --cfg_scale 9.0 \
    --temp 0.8 \
    --num_steps 50 \

Evaluation

We evaluate the image generation results using GenEval to assess the compositionality of the generated results, and CLIP and ImageReward scores to evaluate text-image alignment and quality. Our 1B models achieve competitive performance with modern text-to-image models, which are mostly trained using massive private data.

Model	Size	Reso.	Single Obj.	Two Obj.	Counting	Colors	Color Attr.	Position	Overall ↑	CLIP ↑	IR ↑
SDv1.5	0.9B	512	0.97	0.38	0.35	0.76	0.06	0.04	0.43	0.318	0.201
SDv2.1	0.9B	512	0.98	0.51	0.44	0.85	0.17	0.07	0.50	0.338	0.372
PixArt‑α	0.6B	1024	0.98	0.50	0.44	0.80	0.07	0.08	0.48	0.321	0.871
PixArt‑σ	0.6B	1024	0.98	0.59	0.50	0.80	0.15	0.10	0.52	0.325	0.872
SDXL	2.6B	1024	0.98	0.74	0.39	0.85	0.23	0.15	0.55	0.335	0.600
SD3‑Medium	8.0B	1024	0.97	0.89	0.69	0.82	0.47	0.34	0.69	0.334	0.871

Chameleon	7.0B	512	–	–	–	–	–	–	0.39	–	–
Emu3	8.0B	1024	0.98	0.71	0.34	0.81	0.17	0.21	0.54	0.333	0.872

Instella AR	1.0B	512	0.97	0.45	0.43	0.72	0.15	0.07	0.46	0.318	0.602
Instella Diff.	1.0B	512	0.99	0.78	0.55	0.85	0.32	0.23	0.62	0.334	0.840

Summary

Instella-T2I v0.1 is our first 1-billion-parameter text-to-image model and the newest member of the open-source Instella family. Trained end-to-end on AMD Instinct MI300X GPUs without any in-house prviate data, Instella-T2I v0.1 delivers performance competitive with today’s leading image generation models.

Every component of the release is fully open: model weights, training scripts, configuration files, and data recipes. By sharing the entire stack, we aim to accelerate community research, reproducibility, and creative experimentation. We invite researchers, developers, and practitioners to examine, extend, and build upon Instella-T2I, advancing the model and the broader field of open image generation models together.

Looking ahead, we are expanding in several directions: higher resolution output, stronger compositional reasoning, longer and multimodal prompts, and larger model scales paired with richer data mixtures. Follow our upcoming posts to see how the Instella-T2I series evolves—and help shape where it goes next.

Bias, Risks, and Limitations

The models are being released for research purposes only and are not intended for use cases that require high levels of visual fidelity or factual accuracy, safety-critical situations, health or medical applications, generating misleading images, or facilitating toxic or harmful imagery.
Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement appropriate safety filtering mechanisms as per their respective use cases.
It may be possible to prompt the model to generate images that are factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also be produced by prompts that did not intend to generate such output. Users are therefore requested to be aware of this and exercise caution and responsible judgment when using the model.
The model’s multi-lingual abilities have not been tested and thus it may misunderstand prompts in different languages and generate erroneous or unintended images.

License

See Files for license and any notices.

Contributors

Core contributors: Ze Wang, Hao Chen, Benran Hu, Zicheng Liu

Contributors: Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Yusheng Su, Emad Barsoum

📖 Citation

If you find this project helpful for your research, please consider citing us:

@article{instella-t2i,
  title={Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation},
  author={Wang, Ze and Chen, Hao and Hu, Benran and Liu, Jiang and Sun, Ximeng and Wu, Jialian and Su, Yusheng and Yu, Xiaodong and Barsoum, Emad and Liu, Zicheng},
  journal={arXiv preprint arXiv:2506.21022},
  year={2025}