Does FlashMLA support kv cache fp8 dtype and how to enable FlashMLA ?

#6
by CharlesLincoln - opened

Thank you for your work! I have built vllm myself following the README.md, but when I tried to use --kv-cache-dtype, the log shows vllm still uses Triton MLA backend.
Following is the command I use and the log of vllm.

CUDA_DEVICE_ORDER=PCI_BUS_ID NCCL_P2P_DISABLE=1 VLLM_USE_V1=0 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_MARLIN_USE_ATOMIC_ADD=1 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --max-model-len 30000 --max-seq-len-to-capture 30000 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --kv-cache-dtype fp8_e4m3 --gpu-memory-utilization 0.95 --served-model-name deepseek-v3-0324 --model /workspace/huggingface_models/cognitivecomputations/DeepSeek-V3-0324-AWQ --quantization awq_marlin --max_num_seqs 6 --max_num_batched_tokens 150000

INFO 04-04 22:34:57 [api_server.py:1034] vLLM API server version 0.1.dev1+ga7a3fd1.d20250403 [809/1957]
INFO 04-04 22:34:57 [api_server.py:1035] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_or
igins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response
role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=Fal
se, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/workspace/hugging
face_models/cognitivecomputations/DeepSeek-V3-0324-AWQ', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer

revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtyp
e='auto', kv_cache_dtype='fp8_e4m3', max_model_len=30000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None,
pipeline_parallel_size=1, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=N
one, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_o
ffload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=150000, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_
threshold=0, max_num_seqs=6, max_logprobs=20, disable_log_stats=False, quantization='awq_marlin', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_s
eq_len_to_capture=30000, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processo
r_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lo
ra_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_schedule
r_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_config=None, model_loader_extra_config=None,
ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-v3-0324'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None
, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_poo
ler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sl
eep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_l
en=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-04 22:34:57 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 04-04 22:35:05 [config.py:598] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-04 22:35:06 [awq_marlin.py:113] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 04-04 22:35:06 [config.py:1213] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop w
ithout a proper scaling factor
INFO 04-04 22:35:06 [config.py:1591] Defaulting to use mp for distributed inference
INFO 04-04 22:35:06 [config.py:1772] Chunked prefill is enabled with max_num_batched_tokens=150000.
INFO 04-04 22:35:06 [api_server.py:246] Started engine process with PID 4033746
INFO 04-04 22:35:10 [init.py:239] Automatically detected platform cuda.
INFO 04-04 22:35:13 [llm_engine.py:242] Initializing a V0 LLM engine (v0.1.dev1+ga7a3fd1.d20250403) with config: model='/workspace/huggingface_models/cognitivecomputations/DeepSee
k-V3-0324-AWQ', speculative_config=None, tokenizer='/workspace/huggingface_models/cognitivecomputations/DeepSeek-V3-0324-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revi
sion=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, te
nsor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, decodi
ng_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None
, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-v3-0324, num_scheduler_steps=1, multi_step_stream_outputs=True, enable
_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_conf
ig={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[8,4,2,1],"max_capture_size":8}, use_cached_outputs=True,
WARNING 04-04 22:35:13 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external env
ironment to tune this value as needed.
/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py:110: UserWarning: Failed to get the IP address, using 0.0.0.0 by default.The va
lue can be set by the environment variable VLLM_HOST_IP or HOST_IP.
get_ip(), get_open_port())
INFO 04-04 22:35:14 [cuda.py:193] Using Triton MLA backend.
...
...
ERROR 04-04 22:35:25 [engine.py:448] TritonMLA with FP8 KV cache not yet supported
ERROR 04-04 22:35:25 [engine.py:448] Traceback (most recent call last):
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-04 22:35:25 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-04 22:35:25 [engine.py:448] return cls(
ERROR 04-04 22:35:25 [engine.py:448] ^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in init
ERROR 04-04 22:35:25 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 281, in init
ERROR 04-04 22:35:25 [engine.py:448] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 286, in init
ERROR 04-04 22:35:25 [engine.py:448] super().init(*args, **kwargs)
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in init
ERROR 04-04 22:35:25 [engine.py:448] self._init_executor()
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor
ERROR 04-04 22:35:25 [engine.py:448] self._run_workers("load_model",
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-04 22:35:25 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/utils.py", line 2347, in run_method
ERROR 04-04 22:35:25 [engine.py:448] return func(*args, **kwargs)
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
ERROR 04-04 22:35:25 [engine.py:448] self.model_runner.load_model()
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1113, in load_model
ERROR 04-04 22:35:25 [engine.py:448] self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
ERROR 04-04 22:35:25 [engine.py:448] return loader.load_model(vllm_config=vllm_config)
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 441, in load_model
ERROR 04-04 22:35:25 [engine.py:448] model = _initialize_model(vllm_config=vllm_config)
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-04 22:35:25 [engine.py:448] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 680, in init
ERROR 04-04 22:35:25 [engine.py:448] self.model = DeepseekV2Model(vllm_config=vllm_config,
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 151, in init
ERROR 04-04 22:35:25 [engine.py:448] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 619, in init
ERROR 04-04 22:35:25 [engine.py:448] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 610, in make_layers
ERROR 04-04 22:35:25 [engine.py:448] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 621, in
ERROR 04-04 22:35:25 [engine.py:448] lambda prefix: DeepseekV2DecoderLayer(
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 509, in init
ERROR 04-04 22:35:25 [engine.py:448] self.self_attn = attn_cls(
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 441, in init
ERROR 04-04 22:35:25 [engine.py:448] self.mla_attn = Attention(
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/attention/layer.py", line 130, in init
ERROR 04-04 22:35:25 [engine.py:448] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
ERROR 04-04 22:35:25 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-04 22:35:25 [engine.py:448] File "/data/anaconda3/envs/vllm_dev/lib/python3.12/site-packages/vllm/attention/backends/triton_mla.py", line 63, in init
ERROR 04-04 22:35:25 [engine.py:448] raise NotImplementedError(
ERROR 04-04 22:35:25 [engine.py:448] NotImplementedError: TritonMLA with FP8 KV cache not yet supported
ERROR 04-04 22:35:25 [multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 4034012 died, exit code: -15
INFO 04-04 22:35:25 [multiproc_worker_utils.py:124] Killing local vLLM worker processes

Try using flashinfer backend (pip install flashinfer-python)

If desired, you can also manually set the backend of your choice by configuring the environment variable VLLM_ATTENTION_BACKEND to one of the following options: FLASH_ATTN, FLASHINFER or XFORMERS
Cognitive Computations org

MLA doesn't support FP8 in vLLM currently.

v2ray changed discussion status to closed

Try using flashinfer backend (pip install flashinfer-python)

If desired, you can also manually set the backend of your choice by configuring the environment variable VLLM_ATTENTION_BACKEND to one of the following options: FLASH_ATTN, FLASHINFER or XFORMERS

Thanks for your advice. I notice the FLASHMLA is also supported in the page https://docs.vllm.ai/en/stable/serving/env_vars.html,
so I tried to use the environment variable VLLM_ATTENTION_BACKEND=FLASHMLA, but it did not work. VLLM still uses TritonMLA.

MLA doesn't support FP8 in vLLM currently.

Thanks!I would like to know if FLASHMLA is automatically enabled, or how it should be enabled. I am using the V1 engine and the VLLM_ATTENTION_BACKEND=FLASHMLA environment variable, but the VLLM log still shows that TritonMLA is being used.

Cognitive Computations org

@CharlesLincoln What is your vLLM version, and can you provide your env like the full command you used, the number of GPUs, and the type of GPU?

@CharlesLincoln What is your vLLM version, and can you provide your env like the full command you used, the number of GPUs, and the type of GPU?

I forked the VLLM repository and merged the three pull reqeuests you mentioned in the README. The new repository url is https://github.com/UnlimitedWand/vllm.git
I comment out the sm90 codes in file cmake/external_projects/flashmla.cmake then built it using following command with torch 2.8.0.dev20250401, cuda 12.8 and NVIDIA A100*8:

python use_existing_torch.py
pip install -r requirements/build.txt
pip install  . --no-build-isolation

run command:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_P2P_DISABLE=1 
export VLLM_USE_V1=1 
export VLLM_WORKER_MULTIPROC_METHOD=spawn 
export VLLM_MARLIN_USE_ATOMIC_ADD=1 
export VLLM_ATTENTION_BACKEND=FLASHMLA
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --max-model-len 120000 --max-seq-len-to-capture 120000 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --served-model-name deepseek-v3-0324 --model /workspace/huggingface_models/cognitivecomputations/DeepSeek-V3-0324-AWQ --quantization awq_marlin --max_num_seqs 5 --max_num_batched_tokens 120000
Cognitive Computations org

I suspect it's your env var settings, can you try remove the nccl p2p disable? What's the reason that you enabled it?

I suspect it's your env var settings, can you try remove the nccl p2p disable? What's the reason that you enabled it?

When I enable it, VLLM will get stuck and never load the model file, and the python process will make the utilization of a CPU core reach 100%. After searching online, someone said that disabling this option can start it, and I tried it and it did indeed start normally.

Cognitive Computations org

Can you try the wheel I provided?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment