Issues with running on vLLM
#10
by
daksh-ifad
- opened
Hi guys! I'm running into some errors with the vLLM deployment. I tried a couple weeks ago and got some different errors (Something like: The checkpoint has model type apertus but Transformers does not recognize this architecture.
) but then noted that Apertus support wasn't yet merged to vLLM yet. So I'm trying again with the latest vLLM image but I still run into some errors. The logs are pasted below:
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
INFO 09-25 07:10:27 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=1) INFO 09-25 07:10:28 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=1) INFO 09-25 07:10:28 [utils.py:328] non-default args: {'chat_template': '/chat-templates/too_chat_template_apertus_json.jinja', 'model': 'swiss-ai/Apertus-70B-Instruct-2509', 'trust_remote_code': True, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'max_num_seqs': 20}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 09-25 07:10:34 [__init__.py:742] Resolved architecture: ApertusForCausalLM
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 09-25 07:10:34 [__init__.py:1815] Using max model len 65536
(APIServer pid=1) INFO 09-25 07:10:36 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1) File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 2011, in <module>
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1589, in inner
(APIServer pid=1) return fn(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 114, in __init__
(APIServer pid=1) self.tokenizer = init_tokenizer_from_configs(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py", line 123, in init_tokenizer_from_configs
(APIServer pid=1) return TokenizerGroup(
(APIServer pid=1) ^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py", line 28, in __init__
(APIServer pid=1) self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer.py", line 217, in get_tokenizer
(APIServer pid=1) tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1141, in from_pretrained
(APIServer pid=1) tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
(APIServer pid=1) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/auto_factory.py", line 815, in __getitem__
(APIServer pid=1) raise KeyError(key)
(APIServer pid=1) KeyError: <class 'transformers.models.apertus.configuration_apertus.ApertusConfig'>
If anyone has any ideas on how this could be resolved, that would be a big help! Thanks again!
Interestingly, I noticed that I can run RedHat's quantized version of Apertus (RedHatAI/Apertus-70B-Instruct-2509-FP8-dynamic
) without running into any issues above but the official release keeps erroring out
mjaggi
changed discussion status to
closed