<a href="https://colab.research.google.com/github/merveenoyan/smol-vision/blob/main/inference_gists/Aria_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Aria in Transformers

Aria, all-powerful vision language model by Rhymes AI is now available in transformers! It's a 25.6B model that takes around 32 GB VRAM as is and requires around 35GB VRAM (in total) for inference when used with bfloat16 precision, so A100 with 40GB on Colab is enough to run it. You can also opt for lower precision with 4/8-bit quantization which we're demonstrating in this notebook.

As of Dec 11 the model is merged to transformers main but not yet released so we install the latest transformers.

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git bitsandbytes accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m85.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


We can load the model using `AriaForConditionalGeneration`, and processor with `AriaProcessor` class. We put a `LOAD_4BIT` flag here for your convenience to load in 4-bit to save memory.

In [None]:
import requests
import torch
from PIL import Image
from transformers import AriaForConditionalGeneration, AriaProcessor, BitsAndBytesConfig

LOAD_4BIT = False

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True)

model = AriaForConditionalGeneration.from_pretrained("rhymes-ai/Aria", device_map="auto",
                                                     quantization_config=bnb_config if LOAD_4BIT else None,
                                                     torch_dtype=torch.bfloat16)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/76.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/12 [00:00<?, ?it/s]

model-00001-of-00012.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00012.safetensors:   0%|          | 0.00/4.57G [00:00<?, ?B/s]

model-00003-of-00012.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

model-00004-of-00012.safetensors:   0%|          | 0.00/4.57G [00:00<?, ?B/s]

model-00005-of-00012.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

model-00006-of-00012.safetensors:   0%|          | 0.00/4.57G [00:00<?, ?B/s]

model-00007-of-00012.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

model-00008-of-00012.safetensors:   0%|          | 0.00/4.57G [00:00<?, ?B/s]

model-00009-of-00012.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

model-00010-of-00012.safetensors:   0%|          | 0.00/4.57G [00:00<?, ?B/s]

model-00011-of-00012.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

model-00012-of-00012.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]



We can now infer. Aria comes with it's own chat template.

In [None]:
processor = AriaProcessor.from_pretrained("rhymes-ai/Aria")

preprocessor_config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.70M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.1M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/307 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/100 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/558 [00:00<?, ?B/s]

In [None]:
from PIL import Image
import requests

image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cat.png"

image = Image.open(requests.get(image_path, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": "Describe the image in detail.", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs = inputs.to(torch.bfloat16)
inputs.to(model.device)
output = model.generate(
        **inputs,
        max_new_tokens=80,
    )
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids)

In [None]:
response

'The image shows a small, multicolored kitten with white, orange, and black fur lying on a glass surface. The kitten appears to be sleeping peacefully. Behind the kitten, there is an old-fashioned gramophone with a large brass horn and a wooden base. The gramophone is placed on a table or a similar surface. The background features a colorful urban setting'

You can find an interactive demo of Aria [here](https://huggingface.co/spaces/huggingface-projects/Aria).