Error with vision

by olafgeibig - opened Jun 30
Jun 30
ValueError: [conv] Invalid input array with type uint32. Convolution currently only supports floating point types
mlx_vlm.generate --model mlx-community/gemma-3n-E4B-it-5bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image ~/Desktop/Screenshot\ 2025-06-30\ at\ 11.04.00.png
chat_template.jinja: 1.63kB [00:00, 3.30MB/s]                                                         | 0/12 [00:00<?, ?it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 41.25it/s]
Using `use_fast=True` but `torchvision` is not available. Falling back to the slow image processor.
==========
Files: ['/Users/GEO5BE4/Desktop/Screenshot 2025-06-30 at 11.04.00.png']

Prompt: <bos><start_of_turn>user
<image_soft_token>Describe this image.<end_of_turn>
<start_of_turn>model

Traceback (most recent call last):
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/bin/mlx_vlm.generate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/generate.py", line 181, in main
    output = generate(
             ^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1381, in generate
    for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1273, in stream_generate
    for n, (token, logprobs) in enumerate(
                                ^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1117, in generate_step
    outputs = model(input_ids, pixel_values, cache=cache, mask=mask, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 276, in __call__
    inputs_embeds = self.get_input_embeddings(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 168, in get_input_embeddings
    image_features = self.get_image_features(pixel_values)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 216, in get_image_features
    vision_outputs = self.vision_tower(
                     ^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 988, in __call__
    return self.timm_model(x, output_hidden_states)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 955, in __call__
    x = self.conv_stem(x)
        ^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 279, in __call__
    c = self.conv(x)
        ^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx/nn/layers/convolution.py", line 157, in __call__
    y = mx.conv2d(
        ^^^^^^^^^^
ValueError: [conv] Invalid input array with type uint32. Convolution currently only supports floating point types
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
· Sign up or log in to comment