Error with vision
#1
by
olafgeibig
- opened
ValueError: [conv] Invalid input array with type uint32. Convolution currently only supports floating point types
mlx_vlm.generate --model mlx-community/gemma-3n-E4B-it-5bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image ~/Desktop/Screenshot\ 2025-06-30\ at\ 11.04.00.png
chat_template.jinja: 1.63kB [00:00, 3.30MB/s] | 0/12 [00:00<?, ?it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 41.25it/s]
Using `use_fast=True` but `torchvision` is not available. Falling back to the slow image processor.
==========
Files: ['/Users/GEO5BE4/Desktop/Screenshot 2025-06-30 at 11.04.00.png']
Prompt: <bos><start_of_turn>user
<image_soft_token>Describe this image.<end_of_turn>
<start_of_turn>model
Traceback (most recent call last):
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/bin/mlx_vlm.generate", line 10, in <module>
sys.exit(main())
^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/generate.py", line 181, in main
output = generate(
^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1381, in generate
for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1273, in stream_generate
for n, (token, logprobs) in enumerate(
^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1117, in generate_step
outputs = model(input_ids, pixel_values, cache=cache, mask=mask, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 276, in __call__
inputs_embeds = self.get_input_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 168, in get_input_embeddings
image_features = self.get_image_features(pixel_values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/gemma3n.py", line 216, in get_image_features
vision_outputs = self.vision_tower(
^^^^^^^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 988, in __call__
return self.timm_model(x, output_hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 955, in __call__
x = self.conv_stem(x)
^^^^^^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx_vlm/models/gemma3n/vision.py", line 279, in __call__
c = self.conv(x)
^^^^^^^^^^^^
File "/Users/GEO5BE4/opt/mlx-vlm/.venv/lib/python3.12/site-packages/mlx/nn/layers/convolution.py", line 157, in __call__
y = mx.conv2d(
^^^^^^^^^^
ValueError: [conv] Invalid input array with type uint32. Convolution currently only supports floating point types