Error with vision

#1
by olafgeibig - opened

ValueError: [conv] Invalid input array with type uint32. Convolution currently only supports floating point types

mlx_vlm.generate --model mlx-community/gemma-3n-E4B-it-5bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image ~/Desktop/Screenshot\ 2025-06-30\ at\ 11.04.00.png
chat_template.jinja: 1.63kB [00:00, 3.30MB/s]                                                         | 0/12 [00:00<?, ?it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 41.25it/s]
Using `use_fast=True` but `torchvision` is not available. Falling back to the slow image processor.
==========
Files: ['/Users/GEO5BE4/Desktop/Screenshot 2025-06-30 at 11.04.00.png']

Prompt: <bos><start_of_turn>user
<image_soft_token>Describe this image.<end_of_turn>
<start_of_turn>model

Traceback (most recent call last):
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/bin/mlx_vlm.generate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/generate.py", line 181, in main
    output = generate(
             ^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1381, in generate
    for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1273, in stream_generate
    for n, (token, logprobs) in enumerate(
                                ^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1117, in generate_step
    outputs = model(input_ids, pixel_values, cache=cache, mask=mask, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 276, in __call__
    inputs_embeds = self.get_input_embeddings(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 168, in get_input_embeddings
    image_features = self.get_image_features(pixel_values)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 216, in get_image_features
    vision_outputs = self.vision_tower(
                     ^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 988, in __call__
    return self.timm_model(x, output_hidden_states)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 955, in __call__
    x = self.conv_stem(x)
        ^^^^^^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 279, in __call__
    c = self.conv(x)
        ^^^^^^^^^^^^
  File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx/nn/layers/convolution.py", line 157, in __call__
    y = mx.conv2d(
        ^^^^^^^^^^
ValueError: [conv] Invalid input array with type uint32. Convolution currently only supports floating point types

Sign up or log in to comment