granite-vision-3.2-2b failing on sglang with LlavaNextForConditionalGeneration not supported

by didduran - opened Mar 4

Discussion

didduran

Mar 4

•

edited Mar 4

Hi,

I have successfully run the 3.1 versions of granite models on SGLang project (https://github.com/sgl-project/sglang)

I am now trying to run granite-vision-3.2-2b

But it fails, with the messages below: in particular Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now?

will IBM work with SGLang project allow this model to run as well on SGLang to be able to leverage its inference acceleration ? It seems that the collaboration has been working for v3.1. see https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/granite.py

Note: it seems to be specific to granite-vision-3.2-2b because granite-3.2-2b-instruct works fine

Thanks,
Didier

bash-5.2# python3.12 -m sglang.launch_server --model ibm-granite/granite-vision-3.2-2b --model-path ibm-granite/granite-vision-3.2-2b --port 30000 --host 0.0.0.0 --log-level debug --trust-remote-code --tensor-parallel-size 4 --enable-p2p-check --disable-cuda-graph
INFO 03-04 09:02:43 __init__.py:190] Automatically detected platform cuda.
[2025-03-04 09:02:46] Setting Triton cache manager to: sglang.srt.utils:CustomCacheManager
[2025-03-04 09:02:46] server_args=ServerArgs(model_path='ibm-granite/granite-vision-3.2-2b', tokenizer_path='ibm-granite/granite-vision-3.2-2b', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='ibm-granite/granite-vision-3.2-2b', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=4, stream_interval=1, stream_output=False, random_seed=108653913, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='debug', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=80, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
[2025-03-04 09:02:53 TP0] Init torch distributed begin.
[2025-03-04 09:02:53 TP0] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP1] Init torch distributed begin.
[2025-03-04 09:02:53 TP3] Init torch distributed begin.
[2025-03-04 09:02:53 TP2] Init torch distributed begin.
[2025-03-04 09:02:53 TP1] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP2] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP3] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP0] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP2] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP0] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP2] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP1] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP3] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP3] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP1] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP3] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP2] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP0] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP1] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP0] Binding to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP0] Message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<sglang.srt.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f72a6c752e0>, local_subscribe_port=51275, remote_subscribe_port=None)
[2025-03-04 09:02:53 TP3] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP2] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP1] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:54 TP2] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP0] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP1] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP3] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 194, in __init__
    self.load_model()
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 317, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 357, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 136, in _initialize_model
    model_class, _ = get_model_architecture(model_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/utils.py", line 37, in get_model_architecture
    return ModelRegistry.resolve_model_cls(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/registry.py", line 65, in resolve_model_cls
    return self._raise_for_unsupported(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/registry.py", line 32, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now. Supported architectures: dict_keys(['BaichuanForCausalLM', 'ChatGLMModel', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'DbrxForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'ExaoneForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma2ForSequenceClassification', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GraniteForCausalLM', 'Grok1ForCausalLM', 'Grok1ModelForCausalLM', 'InternLM2ForCausalLM', 'InternLM2ForRewardModel', 'LlamaForCausalLM', 'Phi3ForCausalLM', 'InternLM3ForCausalLM', 'LlamaForClassification', 'LlamaForCausalLMEagle', 'LlamaEmbeddingModel', 'MistralModel', 'LlamaForSequenceClassification', 'LlamaForSequenceClassificationWithNormal_Weights', 'LlavaLlamaForCausalLM', 'LlavaQwenForCausalLM', 'LlavaMistralForCausalLM', 'LlavaVidForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MiniCPMV', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MllamaForConditionalGeneration', 'OlmoForCausalLM', 'Olmo2ForCausalLM', 'OlmoeForCausalLM', 'Phi3SmallForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2ForCausalLMEagle', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'StableLmForCausalLM', 'TorchNativeLlamaForCausalLM', 'TorchNativePhi3ForCausalLM', 'XverseForCausalLM', 'XverseMoeForCausalLM', 'YiVLForCausalLM'])

didduran

Mar 4

Addition question: if I want to try to extend https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/granite.py for Vision 3.2 in SGLang, is this code from Transformers a good starting point ?

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/configuration_llava_next.py

Thanks,
Didier

gabegoodhart

IBM Granite org Mar 4

Thanks for investigating! It looks like the gap here is overall support for the LlavaNext architecture which is what granite-vision is based on. In broad strokes, this architecture merges a visual encoder with some other LLM architecture (granite in the case of granite-vision). It appears that llava is already supported (here), so I think you're on the right track with trying to look at the transformers implementation of LlavaNext and go from there.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment