granite-vision-3.2-2b failing on sglang with LlavaNextForConditionalGeneration not supported

#7
by didduran - opened

Hi,

I have successfully run the 3.1 versions of granite models on SGLang project (https://github.com/sgl-project/sglang)

I am now trying to run granite-vision-3.2-2b

But it fails, with the messages below: in particular Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now?

will IBM work with SGLang project allow this model to run as well on SGLang to be able to leverage its inference acceleration ? It seems that the collaboration has been working for v3.1. see https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/granite.py

Note: it seems to be specific to granite-vision-3.2-2b because granite-3.2-2b-instruct works fine

Thanks,
Didier

bash-5.2# python3.12 -m sglang.launch_server --model ibm-granite/granite-vision-3.2-2b --model-path ibm-granite/granite-vision-3.2-2b --port 30000 --host 0.0.0.0 --log-level debug --trust-remote-code --tensor-parallel-size 4 --enable-p2p-check --disable-cuda-graph
INFO 03-04 09:02:43 __init__.py:190] Automatically detected platform cuda.
[2025-03-04 09:02:46] Setting Triton cache manager to: sglang.srt.utils:CustomCacheManager
[2025-03-04 09:02:46] server_args=ServerArgs(model_path='ibm-granite/granite-vision-3.2-2b', tokenizer_path='ibm-granite/granite-vision-3.2-2b', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='ibm-granite/granite-vision-3.2-2b', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=4, stream_interval=1, stream_output=False, random_seed=108653913, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='debug', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=80, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
INFO 03-04 09:02:50 __init__.py:190] Automatically detected platform cuda.
[2025-03-04 09:02:53 TP0] Init torch distributed begin.
[2025-03-04 09:02:53 TP0] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP1] Init torch distributed begin.
[2025-03-04 09:02:53 TP3] Init torch distributed begin.
[2025-03-04 09:02:53 TP2] Init torch distributed begin.
[2025-03-04 09:02:53 TP1] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP2] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP3] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:30779 backend=nccl
[2025-03-04 09:02:53 TP0] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP2] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP0] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP2] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP1] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP3] Found nccl from library libnccl.so.2
[2025-03-04 09:02:53 TP3] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP1] sglang is using nccl==2.21.5
[2025-03-04 09:02:53 TP3] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP2] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP0] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP1] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-03-04 09:02:53 TP0] Binding to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP0] Message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<sglang.srt.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f72a6c752e0>, local_subscribe_port=51275, remote_subscribe_port=None)
[2025-03-04 09:02:53 TP3] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP2] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:53 TP1] Connecting to tcp://127.0.0.1:51275
[2025-03-04 09:02:54 TP2] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP0] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP1] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP3] Load weight begin. avail mem=21.63 GB
[2025-03-04 09:02:54 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 194, in __init__
    self.load_model()
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 317, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 357, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 136, in _initialize_model
    model_class, _ = get_model_architecture(model_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_loader/utils.py", line 37, in get_model_architecture
    return ModelRegistry.resolve_model_cls(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/registry.py", line 65, in resolve_model_cls
    return self._raise_for_unsupported(architectures)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/registry.py", line 32, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now. Supported architectures: dict_keys(['BaichuanForCausalLM', 'ChatGLMModel', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'DbrxForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'ExaoneForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma2ForSequenceClassification', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GraniteForCausalLM', 'Grok1ForCausalLM', 'Grok1ModelForCausalLM', 'InternLM2ForCausalLM', 'InternLM2ForRewardModel', 'LlamaForCausalLM', 'Phi3ForCausalLM', 'InternLM3ForCausalLM', 'LlamaForClassification', 'LlamaForCausalLMEagle', 'LlamaEmbeddingModel', 'MistralModel', 'LlamaForSequenceClassification', 'LlamaForSequenceClassificationWithNormal_Weights', 'LlavaLlamaForCausalLM', 'LlavaQwenForCausalLM', 'LlavaMistralForCausalLM', 'LlavaVidForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MiniCPMV', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MllamaForConditionalGeneration', 'OlmoForCausalLM', 'Olmo2ForCausalLM', 'OlmoeForCausalLM', 'Phi3SmallForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2ForCausalLMEagle', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'StableLmForCausalLM', 'TorchNativeLlamaForCausalLM', 'TorchNativePhi3ForCausalLM', 'XverseForCausalLM', 'XverseMoeForCausalLM', 'YiVLForCausalLM'])

Addition question: if I want to try to extend https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/granite.py for Vision 3.2 in SGLang, is this code from Transformers a good starting point ?

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/configuration_llava_next.py

Thanks,
Didier

IBM Granite org

Thanks for investigating! It looks like the gap here is overall support for the LlavaNext architecture which is what granite-vision is based on. In broad strokes, this architecture merges a visual encoder with some other LLM architecture (granite in the case of granite-vision). It appears that llava is already supported (here), so I think you're on the right track with trying to look at the transformers implementation of LlavaNext and go from there.

Sign up or log in to comment