vLLM Deployment Problems
I am trying to run the vllm
command with the --lora-modules
argument, but I am encountering issues with the lora_extra_vocab_size
and loading the LoRA modules. Please help with loading the speech and vision LoRA modules.
Steps to Reproduce
I am using the following commands to download the model and get the path to the model directory, so that I can use it when providing the --lora-modules
argument to the vllm
command.
huggingface-cli download microsoft/Phi-4-multimodal-instruct --include="*.safetensors"
pushd /root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/*/ && export MODEL_DIR=$(pwd) && popd
First Issue: ValueError: lora_extra_vocab_size (0) must be one of (256, 512).
This is because the vLLM usage instructions, the value for --lora-extra-vocab-size 0
seems to be incorrect.
$ python -m vllm.entrypoints.openai.api_server \
--model 'microsoft/Phi-4-multimodal-instruct' \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--distributed-executor-backend mp \
--dtype auto \
--trust-remote-code \
--max-model-len 131072 \
--enable-lora \
--max-lora-rank 320 \
--lora-extra-vocab-size 0 \
--limit-mm-per-prompt audio=3,image=3 \
--max-loras 2 \
--lora-modules speech=${MODEL_DIR}/speech-lora vision=${MODEL_DIR}/vision-lora
INFO 03-31 00:52:49 [__init__.py:256] Automatically detected platform cuda.
INFO 03-31 00:52:50 [api_server.py:977] vLLM API server version 0.8.1
INFO 03-31 00:52:50 [api_server.py:978] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='speech', path='/root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132/speech-lora', base_model_name=None), LoRAModulePath(name='vision', path='/root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132/vision-lora', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='microsoft/Phi-4-multimodal-instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=131072, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='mp', pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'audio': 3, 'image': 3}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=2, max_lora_rank=320, lora_extra_vocab_size=0, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 03-31 00:53:00 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-31 00:53:07 [config.py:583] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
WARNING 03-31 00:53:07 [arg_utils.py:1765] ['Phi4MMForCausalLM'] is not supported by the V1 Engine. Falling back to V0.
WARNING 03-31 00:53:07 [arg_utils.py:1652] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1059, in <module>
uvloop.run(run_server(args))
File "/opt/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/opt/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1012, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 161, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1332, in create_engine_config
lora_config = LoRAConfig(
^^^^^^^^^^^
File "<string>", line 11, in __init__
File "/opt/venv/lib/python3.12/site-packages/vllm/config.py", line 2333, in __post_init__
raise ValueError(
ValueError: lora_extra_vocab_size (0) must be one of (256, 512).
Second Issue: Loading lora speech failed: No adapter found
So I changed the value to 256
and ran the command again. This time, I got a different error: ValueError: Loading lora speech failed: No adapter found for /root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132/speech-lora
.
$ python -m vllm.entrypoints.openai.api_server \
--model 'microsoft/Phi-4-multimodal-instruct' \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--distributed-executor-backend mp \
--dtype auto \
--trust-remote-code \
--max-model-len 131072 \
--enable-lora \
--max-lora-rank 320 \
--lora-extra-vocab-size 256 \
--limit-mm-per-prompt audio=3,image=3 \
--max-loras 2 \
--lora-modules speech=${MODEL_DIR}/speech-lora vision=${MODEL_DIR}/vision-lora
INFO 03-31 00:57:27 [__init__.py:256] Automatically detected platform cuda.
INFO 03-31 00:57:29 [api_server.py:977] vLLM API server version 0.8.1
INFO 03-31 00:57:29 [api_server.py:978] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='speech', path='/root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132/speech-lora', base_model_name=None), LoRAModulePath(name='vision', path='/root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132/vision-lora', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='microsoft/Phi-4-multimodal-instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=131072, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='mp', pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'audio': 3, 'image': 3}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=2, max_lora_rank=320, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 03-31 00:57:29 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-31 00:57:36 [config.py:583] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 03-31 00:57:36 [arg_utils.py:1765] ['Phi4MMForCausalLM'] is not supported by the V1 Engine. Falling back to V0.
WARNING 03-31 00:57:36 [arg_utils.py:1652] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 03-31 00:57:36 [api_server.py:241] Started engine process with PID 1273
INFO 03-31 00:57:40 [__init__.py:256] Automatically detected platform cuda.
INFO 03-31 00:57:41 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.1) with config: model='microsoft/Phi-4-multimodal-instruct', speculative_config=None, tokenizer='microsoft/Phi-4-multimodal-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=microsoft/Phi-4-multimodal-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 03-31 00:57:42 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-31 00:57:43 [cuda.py:285] Using Flash Attention backend.
INFO 03-31 00:57:44 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-31 00:57:44 [model_runner.py:1110] Starting to load model microsoft/Phi-4-multimodal-instruct...
INFO 03-31 00:57:44 [cuda.py:269] Cannot use FlashAttention-2 backend for head size 72.
INFO 03-31 00:57:44 [cuda.py:282] Using XFormers backend.
INFO 03-31 00:57:45 [weight_utils.py:257] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.44it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.90it/s]
INFO 03-31 00:57:47 [loader.py:429] Loading weights took 1.62 seconds
WARNING 03-31 00:57:47 [model_runner.py:1120] Regarding multimodal models, vLLM currently only supports adding LoRA to language model.
INFO 03-31 00:57:47 [punica_selector.py:18] Using PunicaWrapperGPU.
WARNING 03-31 00:57:47 [models.py:478] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, vision_encoder.img_processor.encoder.layers.0.self_attn.qkv_proj will be ignored.
WARNING 03-31 00:57:47 [models.py:478] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, vision_encoder.img_processor.encoder.layers.0.self_attn.out_proj will be ignored.
... REDACTED ...
WARNING 03-31 00:57:47 [models.py:478] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, vision_encoder.img_processor.encoder.layers.25.self_attn.out_proj will be ignored.
WARNING 03-31 00:57:47 [models.py:478] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, vision_encoder.img_processor.encoder.layers.25.mlp.fc1 will be ignored.
WARNING 03-31 00:57:47 [models.py:478] Regarding multimodal models, vLLM currently only supports adding LoRA to language model, vision_encoder.img_processor.encoder.layers.25.mlp.fc2 will be ignored.
INFO 03-31 00:57:47 [model_runner.py:1146] Model loading took 10.5605 GB and 2.751561 seconds
INFO 03-31 00:58:06 [worker.py:267] Memory profiling takes 19.20 seconds
INFO 03-31 00:58:06 [worker.py:267] the current vLLM instance can use total_gpu_memory (79.25GiB) x gpu_memory_utilization (0.90) = 71.33GiB
INFO 03-31 00:58:06 [worker.py:267] model weights take 10.56GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 8.76GiB; the rest of the memory reserved for KV Cache is 51.91GiB.
INFO 03-31 00:58:06 [executor_base.py:111] # cuda blocks: 26577, # CPU blocks: 2048
INFO 03-31 00:58:06 [executor_base.py:116] Maximum concurrency for 131072 tokens per request: 3.24x
INFO 03-31 00:58:09 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:20<00:00, 1.68it/s]
INFO 03-31 00:58:30 [model_runner.py:1570] Graph capturing finished in 21 secs, took 0.51 GiB
INFO 03-31 00:58:30 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 42.97 seconds
[rank0]:[W331 00:58:31.243234746 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1059, in <module>
uvloop.run(run_server(args))
File "/opt/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/opt/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1016, in run_server
await init_app_state(engine_client, model_config, app.state, args)
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 892, in init_app_state
await state.openai_serving_models.init_static_loras()
File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_models.py", line 96, in init_static_loras
raise ValueError(load_result.message)
ValueError: Loading lora speech failed: No adapter found for /root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132/speech-lora
But if you look at the files:
$ tree /root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132
/root/.cache/huggingface/hub/models--microsoft--Phi-4-multimodal-instruct/snapshots/0ae13bd0f508a906f8b8288fc5e36b01b903c132
|-- added_tokens.json -> ../../blobs/af52cde61cc39aca58f7b7f04116b317116e75dd
|-- config.json -> ../../blobs/6f294fb015412e07443c36ecc01233c1211f8d0d
|-- configuration_phi4mm.py -> ../../blobs/0abf4196bcf52128567ad4008062fcb2c0c45e33
|-- generation_config.json -> ../../blobs/98769448b636893c5e5780bd4beb070e95a70b48
|-- merges.txt -> ../../blobs/dcecc4524288b351bbd0da8028e74e9b5bcdb9b5
|-- model-00001-of-00003.safetensors -> ../../blobs/c46bb03332d82f6a3eaf85bd20af388dd4d4d68b198c2203c965c7381a466094
|-- model-00002-of-00003.safetensors -> ../../blobs/b3e812c0c8acef4e7f5e34d6c9f77a7640ee4a2b93ea351921365ac62f19918d
|-- model-00003-of-00003.safetensors -> ../../blobs/7be96b7339303752634b202d3f377bcf312a03046586eca6cea23347ace1e65a
|-- model.safetensors.index.json -> ../../blobs/dde52c7dc181beab5d91b97ba009ef4fc6626594
|-- preprocessor_config.json -> ../../blobs/6dd46225a7f8996f811adb74f920995ef206260a
|-- special_tokens_map.json -> ../../blobs/330140f0678dc92ae683e1c1cccffc6a001251ae
|-- speech-lora
| `-- adapter_model.safetensors -> ../../../blobs/1c2237461a4d1f9292cd128147bd3f0f70326a48d5d79c8e0f7583b26c095b30
|-- tokenizer.json -> ../../blobs/4c1b9f641d4f8b7247b8d5007dd3b6a9f6a87cb5123134fe0d326f14d10c0585
|-- tokenizer_config.json -> ../../blobs/eb04aec9ccf2d3bc9f8d28bc8d4e2d4c3a144460
|-- vision-lora
| `-- adapter_model.safetensors -> ../../../blobs/1620b16722edf701038bf66e3cd46412c7cc5458e58df89e9f92cedb71fcbde8
`-- vocab.json -> ../../blobs/ea953a43348cdb3776cb7fd9ea02e3784febde34
2 directories, 16 files
What works?
This command works for me:
$ python -m vllm.entrypoints.openai.api_server \
--model 'microsoft/Phi-4-multimodal-instruct' \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--distributed-executor-backend mp \
--dtype auto \
--trust-remote-code \
--max-model-len 131072 \
--enable-lora \
--max-lora-rank 320 \
--lora-extra-vocab-size 256 \
--limit-mm-per-prompt audio=3,image=3 \
--max-loras 2