Issues with running on vLLM

#10
by daksh-ifad - opened

Hi guys! I'm running into some errors with the vLLM deployment. I tried a couple weeks ago and got some different errors (Something like: The checkpoint has model type apertus but Transformers does not recognize this architecture.) but then noted that Apertus support wasn't yet merged to vLLM yet. So I'm trying again with the latest vLLM image but I still run into some errors. The logs are pasted below:

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 09-25 07:10:27 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=1) INFO 09-25 07:10:28 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=1) INFO 09-25 07:10:28 [utils.py:328] non-default args: {'chat_template': '/chat-templates/too_chat_template_apertus_json.jinja', 'model': 'swiss-ai/Apertus-70B-Instruct-2509', 'trust_remote_code': True, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'max_num_seqs': 20}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 09-25 07:10:34 [__init__.py:742] Resolved architecture: ApertusForCausalLM
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 09-25 07:10:34 [__init__.py:1815] Using max model len 65536
(APIServer pid=1) INFO 09-25 07:10:36 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 2011, in <module>
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1589, in inner
(APIServer pid=1)     return fn(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 114, in __init__
(APIServer pid=1)     self.tokenizer = init_tokenizer_from_configs(
(APIServer pid=1)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py", line 123, in init_tokenizer_from_configs
(APIServer pid=1)     return TokenizerGroup(
(APIServer pid=1)            ^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py", line 28, in __init__
(APIServer pid=1)     self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
(APIServer pid=1)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer.py", line 217, in get_tokenizer
(APIServer pid=1)     tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1141, in from_pretrained
(APIServer pid=1)     tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
(APIServer pid=1)                                                ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/auto_factory.py", line 815, in __getitem__
(APIServer pid=1)     raise KeyError(key)
(APIServer pid=1) KeyError: <class 'transformers.models.apertus.configuration_apertus.ApertusConfig'>

If anyone has any ideas on how this could be resolved, that would be a big help! Thanks again!

Interestingly, I noticed that I can run RedHat's quantized version of Apertus (RedHatAI/Apertus-70B-Instruct-2509-FP8-dynamic) without running into any issues above but the official release keeps erroring out

mjaggi changed discussion status to closed

Sign up or log in to comment