FlashInfer requires sm75+

#48

by hrithiksagar-tih - opened 5 days ago

Discussion

hrithiksagar-tih

5 days ago

•

edited 5 days ago

Did anyone face this error? I installed flash-attn and the rest all as mentioned in the Model Card.

Precise error:

Parse safetensors files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.11it/s]
(EngineCore_0 pid=653876) 
(EngineCore_0 pid=653876)              LL          LL          MMM       MMM 
(EngineCore_0 pid=653876)              LL          LL          MMMM     MMMM
(EngineCore_0 pid=653876)          V   LL          LL          MM MM   MM MM
(EngineCore_0 pid=653876) vvvv  VVVV   LL          LL          MM  MM MM  MM
(EngineCore_0 pid=653876) vvvv VVVV    LL          LL          MM   MMM   MM
(EngineCore_0 pid=653876)  vvv VVVV    LL          LL          MM    M    MM
(EngineCore_0 pid=653876)   vvVVVV     LL          LL          MM         MM
(EngineCore_0 pid=653876)     VVVV     LLLLLLLLLL  LLLLLLLLL   M           M
(EngineCore_0 pid=653876) 
[W806 06:55:23.247131909 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=653876) DEBUG Attempting to acquire lock 140734698624528 on 
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:02,  1.33s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:02<00:01,  1.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00,  1.15s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00,  1.17s/it]
(EngineCore_0 pid=653876) 
(EngineCore_0 pid=653876) ERROR EngineCore failed to start.
(EngineCore_0 pid=653876) Traceback (most recent call last):
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=653876)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=653876)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=653876)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=653876)     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=653876)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=653876)     self._init_executor()
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=653876)     self.collective_rpc("load_model")
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=653876)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=653876)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=653876)     return func(*args, **kwargs)
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=653876)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=653876)     self.model = model_loader.load_model(
(EngineCore_0 pid=653876)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_0 pid=653876)     process_weights_after_loading(model, model_config, target_device)
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(EngineCore_0 pid=653876)     quant_method.process_weights_after_loading(module)
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 257, in process_weights_after_loading
(EngineCore_0 pid=653876)     shuffle_matrix_sf_a(w13_weight_scale[i].view(torch.uint8),
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 380, in shuffle_matrix_sf_a
(EngineCore_0 pid=653876)     return nvfp4_block_scale_interleave(w_shuffled)
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 309, in nvfp4_block_scale_interleave
(EngineCore_0 pid=653876)     return get_fp4_quantization_sm100_module().nvfp4_block_scale_interleave_sm100(
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 96, in get_fp4_quantization_sm100_module
(EngineCore_0 pid=653876)     module = gen_fp4_quantization_sm100_module().build_and_load()
(EngineCore_0 pid=653876)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 66, in gen_fp4_quantization_sm100_module
(EngineCore_0 pid=653876)     return gen_jit_spec(
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 142, in gen_jit_spec
(EngineCore_0 pid=653876)     check_cuda_arch()
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 52, in check_cuda_arch
(EngineCore_0 pid=653876)     raise RuntimeError("FlashInfer requires sm75+")
(EngineCore_0 pid=653876) RuntimeError: FlashInfer requires sm75+
(EngineCore_0 pid=653876) Process EngineCore_0:
(EngineCore_0 pid=653876) Traceback (most recent call last):
(EngineCore_0 pid=653876)   File "home/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=653876)     self.run()
(EngineCore_0 pid=653876)   File "home/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=653876)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=653876)     raise e
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=653876)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=653876)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=653876)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=653876)     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=653876)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=653876)     self._init_executor()
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=653876)     self.collective_rpc("load_model")
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=653876)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=653876)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=653876)     return func(*args, **kwargs)
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=653876)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=653876)     self.model = model_loader.load_model(
(EngineCore_0 pid=653876)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_0 pid=653876)     process_weights_after_loading(model, model_config, target_device)
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(EngineCore_0 pid=653876)     quant_method.process_weights_after_loading(module)
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 257, in process_weights_after_loading
(EngineCore_0 pid=653876)     shuffle_matrix_sf_a(w13_weight_scale[i].view(torch.uint8),
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 380, in shuffle_matrix_sf_a
(EngineCore_0 pid=653876)     return nvfp4_block_scale_interleave(w_shuffled)
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 309, in nvfp4_block_scale_interleave
(EngineCore_0 pid=653876)     return get_fp4_quantization_sm100_module().nvfp4_block_scale_interleave_sm100(
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 96, in get_fp4_quantization_sm100_module
(EngineCore_0 pid=653876)     module = gen_fp4_quantization_sm100_module().build_and_load()
(EngineCore_0 pid=653876)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 66, in gen_fp4_quantization_sm100_module
(EngineCore_0 pid=653876)     return gen_jit_spec(
(EngineCore_0 pid=653876)            ^^^^^^^^^^^^^
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 142, in gen_jit_spec
(EngineCore_0 pid=653876)     check_cuda_arch()
(EngineCore_0 pid=653876)   File "home//.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 52, in check_cuda_arch
(EngineCore_0 pid=653876)     raise RuntimeError("FlashInfer requires sm75+")
(EngineCore_0 pid=653876) RuntimeError: FlashInfer requires sm75+
Traceback (most recent call last):
  File "home//SynthQA/qa_generation/main.py", line 24, in <module>
    generator = QAGenerator(CFG_PATH)
                ^^^^^^^^^^^^^^^^^^^^^
  File "home//SynthQA/qa_generation/src/qa_generator.py", line 162, in __init__
    self.runners[m["id"]] = ModelRunner(m)
                            ^^^^^^^^^^^^^^
  File "home//SynthQA/qa_generation/src/model_runner.py", line 184, in __init__
    raise RuntimeError(f"vLLM worker process failed to initialize:\n{error_trace}")
RuntimeError: vLLM worker process failed to initialize:
Traceback (most recent call last):
  File "home//SynthQA/qa_generation/src/model_runner.py", line 81, in _worker
    llm = LLM(**llm_kwargs)
          ^^^^^^^^^^^^^^^^^
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 277, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args
    return engine_cls.from_vllm_config(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 127, in from_vllm_config
    return cls(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 104, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 79, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 566, in __init__
    super().__init__(
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in __init__
    with launch_core_engines(vllm_config, executor_class,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
    wait_for_engine_startup(
  File "home//.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Meteonis

5 days ago

Set VLLM_USE_FLASHINFER_SAMPLER=0 and it's done

hrithiksagar-tih

5 days ago

•

edited 5 days ago

It worked @Meteonis for online inference, Thanks
I did this: VLLM_USE_FLASHINFER_SAMPLER=0 vllm serve openai/gpt-oss-20b --async-scheduling this command is from vLLM guide: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#h100-h200

I also tried offline inference:

|→ VLLM_USE_FLASHINFER_SAMPLER=0 \
VLLM_USE_TRTLLM_ATTENTION=1 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=1 \
VLLM_DISABLE_FLASHINFER=1 \
python generation/main.py

but i am getting same error:

(EngineCore_0 pid=679714)     raise RuntimeError("FlashInfer requires sm75+")
(EngineCore_0 pid=679714) RuntimeError: FlashInfer requires sm75+
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

lscjd

5 days ago

There are two ways to get past it:

1. Disable FlashInfer altogether

By forcing vLLM to use another attention implementation (e.g. flash-attn or the built-in Torch SDPA kernels), you never invoke FlashInfer’s JIT path. You can do this either via environment variables or CLI flags:

# environment-variable approach
export VLLM_DISABLE_FLASHINFER=1
export VLLM_ATTENTION_BACKEND=FLASH_ATTN

# then start your server normally:
vllm serve openai/gpt-oss-20b --async-scheduling

2. Rebuild FlashInfer for your GPU’s compute capability

If you do want to keep FlashInfer (it can be faster on supported hardware), you need to make sure nvcc knows your GPU’s architecture so that it will generate the right kernels. For example, on a 3090 (CC 8.6) you’d do:

# uninstall any existing prebuilt flashinfer wheel
pip uninstall -y flashinfer-python

# tell PyTorch/NVCC which SM architectures to target:
export TORCH_CUDA_ARCH_LIST="8.6"

# reinstall flashinfer so it JIT-compiles for sm86
pip install flashinfer-python --extra-index-url https://download.pytorch.org/whl/cu121

Then verify with a tiny script:

from vllm.executor.attention.backends.flashinfer import check_cuda_arch
check_cuda_arch()  # should now pass without raising

Once that succeeds, re-run python generation/main.py and the “FlashInfer requires sm75+” error should disappear.（Maybe!!）

FlashInfer’s JIT routines insist on sm75+ because they rely on tensor cores and block‐interleaving schemes only available on newer GPUs ([arXiv][2]). If CUDA can’t see your GPU’s actual CC, or if it’s genuinely older than 7.5, you must either disable it (Option 1) or rebuild for your arch (Option 2).

hrithiksagar-tih

5 days ago

I have installed flashinfer-python,

I am still getting error.

Also, when i use this code:

import os
from vllm import LLM, SamplingParams

# Key env vars (adjusted for compatibility)
os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
os.environ["VLLM_CONFIGURE_LOGGING"] = "0"
os.environ["VLLM_LOGGING_LEVEL"] = "DEBUG"

# For openai oss - enable necessary features
os.environ["VLLM_USE_FLASHINFER_SAMPLER"] = "0"  # Keep as is
os.environ["VLLM_DISABLE_FLASHINFER"] = "0"      # Keep as is (enables flashinfer)
os.environ["VLLM_USE_TRTLLM_ATTENTION"] = "0"
os.environ["VLLM_USE_TRTLLM_DECODE_ATTENTION"] = "0"
os.environ["VLLM_USE_TRTLLM_CONTEXT_ATTENTION"] = "0"
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"  # Changed to FLASHINFER for better MXFP4/MoE support
os.environ["TORCH_CUDA_ARCH_LIST"] = "9.0"
os.environ["VLLM_DTYPE"] = "bfloat16"

os.environ["VLLM_USE_FLASHINFER_MXFP4_BF16_MOE"] = "1"  # Enable this (was disabled) for MXFP4 bf16 MoE handling

# Model config
model_name = "openai/gpt-oss-20b"
max_model_len = 32768
tensor_parallel_size = 1
dtype = "bfloat16"  # Explicitly bf16, as required

# Simple prompt
prompt = "Hello, world! This is a test prompt."

try:
    # Initialize vLLM LLM
    llm = LLM(
        model=model_name,
        max_model_len=max_model_len,
        tensor_parallel_size=tensor_parallel_size,
        trust_remote_code=True,
        dtype=dtype,
        enable_prefix_caching=True,
        enable_chunked_prefill=True
    )
    print("Model loaded successfully!")

    # Sampling params
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.85,
        top_k=40,
        max_tokens=50
    )

    # Generate
    outputs = llm.generate([prompt], sampling_params)
    generated_text = outputs[0].outputs[0].text
    print(f"Generated output: {generated_text}")

except Exception as e:
    import traceback
    print("Error during execution:")
    traceback.print_exc()

The error is not making sense to me, the dtype is also bf16 right?

Error:


INFO 08-06 17:36:22 [utils.py:326] non-default args: {'model': 'openai/gpt-oss-20b', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 32768, 'enable_prefix_caching': True, 'disable_log_stats': True, 'enable_chunked_prefill': True}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
INFO 08-06 17:36:23 [config.py:726] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 3/3 [00:00<00:00,  4.27it/s]
INFO 08-06 17:36:25 [config.py:1759] Using max model len 32768
WARNING 08-06 17:36:25 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.

INFO 08-06 17:36:26 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 08-06 17:36:26 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
(EngineCore_0 pid=1083852) INFO 08-06 17:36:27 [core.py:654] Waiting for init message from front-end.
(EngineCore_0 pid=1083852) INFO 08-06 17:36:27 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250806) with config: model='openai/gpt-oss-20b', speculative_config=None, tokenizer='openai/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-20b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
(EngineCore_0 pid=1083852) 
(EngineCore_0 pid=1083852)              LL          LL          MMM       MMM 
(EngineCore_0 pid=1083852)              LL          LL          MMMM     MMMM
(EngineCore_0 pid=1083852)          V   LL          LL          MM MM   MM MM
(EngineCore_0 pid=1083852) vvvv  VVVV   LL          LL          MM  MM MM  MM
(EngineCore_0 pid=1083852) vvvv VVVV    LL          LL          MM   MMM   MM
(EngineCore_0 pid=1083852)  vvv VVVV    LL          LL          MM    M    MM
(EngineCore_0 pid=1083852)   vvVVVV     LL          LL          MM         MM
(EngineCore_0 pid=1083852)     VVVV     LLLLLLLLLL  LLLLLLLLL   M           M
(EngineCore_0 pid=1083852) 
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[W806 17:36:30.163554518 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=1083852) INFO 08-06 17:36:30 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1083852) WARNING 08-06 17:36:31 [topk_topp_sampler.py:53] FlashInfer is available, but it is not enabled. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please set VLLM_USE_FLASHINFER_SAMPLER=1.
(EngineCore_0 pid=1083852) INFO 08-06 17:36:31 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-20b...
(EngineCore_0 pid=1083852) INFO 08-06 17:36:31 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=1083852) INFO 08-06 17:36:31 [cuda.py:276] Using FlashInfer backend on V1 engine.
(EngineCore_0 pid=1083852) INFO 08-06 17:36:32 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.06it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.15it/s]
(EngineCore_0 pid=1083852) 
(EngineCore_0 pid=1083852) INFO 08-06 17:36:35 [default_loader.py:262] Loading weights took 2.74 seconds
(EngineCore_0 pid=1083852) INFO 08-06 17:36:35 [mxfp4.py:176] Shuffling MoE weights, it might take a while...
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/cpp_ext.py", line 199, in run_ninja
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     subprocess.run(
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/subprocess.py", line 571, in run
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     raise CalledProcessError(retcode, process.args,
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/home/hrithik_sagar/.cache/flashinfer/90/cached_ops', '-f', '/home/hrithik_sagar/.cache/flashinfer/90/cached_ops/fp4_quantization_sm100/build.ninja']' returned non-zero exit status 1.
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] The above exception was the direct cause of the following exception:
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     self._init_executor()
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     self.collective_rpc("load_model")
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     return func(*args, **kwargs)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     self.model = model_loader.load_model(
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     process_weights_after_loading(model, model_config, target_device)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     quant_method.process_weights_after_loading(module)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 257, in process_weights_after_loading
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     shuffle_matrix_sf_a(w13_weight_scale[i].view(torch.uint8),
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 380, in shuffle_matrix_sf_a
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     return nvfp4_block_scale_interleave(w_shuffled)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 309, in nvfp4_block_scale_interleave
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     return get_fp4_quantization_sm100_module().nvfp4_block_scale_interleave_sm100(
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 96, in get_fp4_quantization_sm100_module
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     module = gen_fp4_quantization_sm100_module().build_and_load()
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 123, in build_and_load
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     self.build(verbose)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 115, in build
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     run_ninja(jit_env.FLASHINFER_JIT_DIR, self.ninja_path, verbose)
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/cpp_ext.py", line 211, in run_ninja
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]     raise RuntimeError(msg) from e
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] RuntimeError: Ninja build failed. Ninja output:
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] ninja: Entering directory `/home/hrithik_sagar/.cache/flashinfer/90/cached_ops'
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output fp4_quantization_sm100/quantization.cuda.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -gencode=arch=compute_100a,code=sm_100a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/cpp/kernels/quantization.cu -o fp4_quantization_sm100/quantization.cuda.o 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] FAILED: fp4_quantization_sm100/quantization.cuda.o 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output fp4_quantization_sm100/quantization.cuda.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -gencode=arch=compute_100a,code=sm_100a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/cpp/kernels/quantization.cu -o fp4_quantization_sm100/quantization.cuda.o 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] nvcc fatal   : Unsupported gpu architecture 'compute_100a'
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] [2/3] c++ -MMD -MF fp4_quantization_sm100/fp4Op.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include -fPIC -O3 -std=c++17 -Wno-switch-bool -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/thop/fp4Op.cpp -o fp4_quantization_sm100/fp4Op.o 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] FAILED: fp4_quantization_sm100/fp4Op.o 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] c++ -MMD -MF fp4_quantization_sm100/fp4Op.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include -fPIC -O3 -std=c++17 -Wno-switch-bool -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/thop/fp4Op.cpp -o fp4_quantization_sm100/fp4Op.o 
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/thop/fp4Op.cpp:26:10: fatal error: cuda_fp4.h: No such file or directory
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]    26 | #include <cuda_fp4.h>
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718]       |          ^~~~~~~~~~~~
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] compilation terminated.
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] ninja: build stopped: subcommand failed.
(EngineCore_0 pid=1083852) ERROR 08-06 17:36:39 [core.py:718] 
(EngineCore_0 pid=1083852) Process EngineCore_0:
(EngineCore_0 pid=1083852) Traceback (most recent call last):
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/cpp_ext.py", line 199, in run_ninja
(EngineCore_0 pid=1083852)     subprocess.run(
(EngineCore_0 pid=1083852)   File "/home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/subprocess.py", line 571, in run
(EngineCore_0 pid=1083852)     raise CalledProcessError(retcode, process.args,
(EngineCore_0 pid=1083852) subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/home/hrithik_sagar/.cache/flashinfer/90/cached_ops', '-f', '/home/hrithik_sagar/.cache/flashinfer/90/cached_ops/fp4_quantization_sm100/build.ninja']' returned non-zero exit status 1.
(EngineCore_0 pid=1083852) 
(EngineCore_0 pid=1083852) The above exception was the direct cause of the following exception:
(EngineCore_0 pid=1083852) 
(EngineCore_0 pid=1083852) Traceback (most recent call last):
(EngineCore_0 pid=1083852)   File "/home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=1083852)     self.run()
(EngineCore_0 pid=1083852)   File "/home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=1083852)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=1083852)     raise e
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=1083852)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=1083852)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=1083852)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=1083852)     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=1083852)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=1083852)     self._init_executor()
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=1083852)     self.collective_rpc("load_model")
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=1083852)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=1083852)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=1083852)     return func(*args, **kwargs)
(EngineCore_0 pid=1083852)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=1083852)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=1083852)     self.model = model_loader.load_model(
(EngineCore_0 pid=1083852)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_0 pid=1083852)     process_weights_after_loading(model, model_config, target_device)
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(EngineCore_0 pid=1083852)     quant_method.process_weights_after_loading(module)
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 257, in process_weights_after_loading
(EngineCore_0 pid=1083852)     shuffle_matrix_sf_a(w13_weight_scale[i].view(torch.uint8),
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 380, in shuffle_matrix_sf_a
(EngineCore_0 pid=1083852)     return nvfp4_block_scale_interleave(w_shuffled)
(EngineCore_0 pid=1083852)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 309, in nvfp4_block_scale_interleave
(EngineCore_0 pid=1083852)     return get_fp4_quantization_sm100_module().nvfp4_block_scale_interleave_sm100(
(EngineCore_0 pid=1083852)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/fp4_quantization.py", line 96, in get_fp4_quantization_sm100_module
(EngineCore_0 pid=1083852)     module = gen_fp4_quantization_sm100_module().build_and_load()
(EngineCore_0 pid=1083852)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 123, in build_and_load
(EngineCore_0 pid=1083852)     self.build(verbose)
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/core.py", line 115, in build
(EngineCore_0 pid=1083852)     run_ninja(jit_env.FLASHINFER_JIT_DIR, self.ninja_path, verbose)
(EngineCore_0 pid=1083852)   File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/jit/cpp_ext.py", line 211, in run_ninja
(EngineCore_0 pid=1083852)     raise RuntimeError(msg) from e
(EngineCore_0 pid=1083852) RuntimeError: Ninja build failed. Ninja output:
(EngineCore_0 pid=1083852) ninja: Entering directory `/home/hrithik_sagar/.cache/flashinfer/90/cached_ops'
(EngineCore_0 pid=1083852) [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output fp4_quantization_sm100/quantization.cuda.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -gencode=arch=compute_100a,code=sm_100a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/cpp/kernels/quantization.cu -o fp4_quantization_sm100/quantization.cuda.o 
(EngineCore_0 pid=1083852) FAILED: fp4_quantization_sm100/quantization.cuda.o 
(EngineCore_0 pid=1083852) /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output fp4_quantization_sm100/quantization.cuda.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -gencode=arch=compute_100a,code=sm_100a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/cpp/kernels/quantization.cu -o fp4_quantization_sm100/quantization.cuda.o 
(EngineCore_0 pid=1083852) nvcc fatal   : Unsupported gpu architecture 'compute_100a'
(EngineCore_0 pid=1083852) [2/3] c++ -MMD -MF fp4_quantization_sm100/fp4Op.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include -fPIC -O3 -std=c++17 -Wno-switch-bool -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/thop/fp4Op.cpp -o fp4_quantization_sm100/fp4Op.o 
(EngineCore_0 pid=1083852) FAILED: fp4_quantization_sm100/fp4Op.o 
(EngineCore_0 pid=1083852) c++ -MMD -MF fp4_quantization_sm100/fp4Op.o.d -DTORCH_EXTENSION_NAME=fp4_quantization_sm100 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal -I/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/include -isystem /home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/spdlog/include -fPIC -O3 -std=c++17 -Wno-switch-bool -DENABLE_BF16 -DENABLE_FP8 -c /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/thop/fp4Op.cpp -o fp4_quantization_sm100/fp4Op.o 
(EngineCore_0 pid=1083852) /projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/thop/fp4Op.cpp:26:10: fatal error: cuda_fp4.h: No such file or directory
(EngineCore_0 pid=1083852)    26 | #include <cuda_fp4.h>
(EngineCore_0 pid=1083852)       |          ^~~~~~~~~~~~
(EngineCore_0 pid=1083852) compilation terminated.
(EngineCore_0 pid=1083852) ninja: build stopped: subcommand failed.
(EngineCore_0 pid=1083852) 
Error during execution:
Traceback (most recent call last):
  File "/tmp/ipykernel_1073149/2326995983.py", line 32, in <module>
    llm = LLM(
          ^^^^
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 277, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args
    return engine_cls.from_vllm_config(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 127, in from_vllm_config
    return cls(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 104, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 79, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 566, in __init__
    super().__init__(
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in __init__
    with launch_core_engines(vllm_config, executor_class,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hrithik_sagar/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
    wait_for_engine_startup(
  File "/projects/data/vision-team/hrithik_sagar/synQA_patram_2/.gpt-oss-env/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment