How Long Prompts Block Other Requests - Optimizing LLM Performance

In the previous part of our series on LLM performance, we looked into the differences between the prefill and decode phases during token generation. In short: for the first output token (prefill step), the entire prompt needs to be processed, which can be parallelized efficiently and can saturate GPU utilization. For all later output tokens (decode steps), only a single additional token needs to be processed, which is less compute-intensive but must be done sequentially. When many requests are processed concurrently, any strategy that aims for low latency needs to run prefill steps for newly arriving requests while the decode steps of previously scheduled requests are still ongoing. Concurrent processing of new as well as running requests therefore requires careful balancing between the prefill and decode stages, which presents two major challenges, which we will discuss in the following. One is a readily solvable issue, while the other one constitutes a more fundamental flaw.
The Simpler Challenge: Long Prompts Block the Queue
Since individual decode steps are not compute-intensive, one can increase throughput by batching decodes of multiple requests. For prefill, however, this approach does not work. Because of the parallelized processing of all prompt tokens, a single prefill step can already saturate GPU utilization. Consequently, in the default chunked-prefill strategy of vLLM, each prefill chunk contains only prompt tokens of a single request. The next request in line has to wait until the previous prefill phase has been finished before its own prefill phase can start.
This sequential scheduling of prefill chunks for different requests poses a challenge: whenever a request with a very long prompt is scheduled for prefill, any subsequent request has to wait for the duration of the long prefill before its processing starts; a long prompt blocks the prefill-queue. (Note that the sequential processing of prefills is the default characteristic of chunked-prefill and only appears when there already is a concurrent request in its decode phase; hence the name "partial prefill".)

Unfortunately, this challenge can neither be solved with vLLM-side priority scheduling (see the first article of this series) nor with a more sophisticated upstream scheduler. The reason is that the long prompt can be scheduled before any subsequent requests exist, so there is nothing the scheduler could wait for.
Request-Parallel Prefills
A straightforward solution would be to process prefill chunks of different requests in parallel. This might not be resource-optimized, as single-request prefill chunks could already saturate compute-power. Any additional prefill executed in parallel would likely prolong the prefill duration a bit and slow down any concurrent decode-requests even further. This would be acceptable if it reduced the latency of short requests and made the system appear more responsive. This approach fails, however, when the next request in line has a long prompt too. In such a case, two compute-intensive prefills would be batched together and result in a severe slowdown.
In one of the latest vLLM updates, an improved strategy has been implemented: it allows for parallel prefills of different requests but with a limit to the number of concurrently processed long prompt requests. An example configuration could enable batching of prefills for four requests, but only one of them may be longer than 10,000 prompt tokens. With such a configuration, the behavior for longer requests is still the same as before: long prompts are processed sequentially. Short requests, however, no longer need to wait for the long prefill of a previous request to finish; short prompts can take a fast lane. These requests no longer suffer from long waiting times and show much lower time-to-first-token metrics.
Of course, parallel prefills can only reduce waiting times; but the time-per-output-token remains elevated during the concurrent long-prefill operation. In this regard, request-parallel prefills show the same behavior and performance as standard chunked-prefill, just with a shorter time-to-first-token.

The Fundamental Flaw: Token Generation Slowed Down by Parallel Prefills
Whenever prefill and decode of different requests are executed in the same GPU operation, it takes longer than for an isolated decode step. The user experiences an interruption or a slowing down of the token generation by a subsequent request. In particular, a single request with a long prompt is sufficient to slow down all previously scheduled requests that are already in their decode phases.
This is a fundamental flaw in the concurrent processing of prefill and decode on the same GPUs, because there is little you can do:
- (a) You could penalize long prompts and let them wait (e.g. until all short, high-priority requests have been finished). This comes at the price of increased latency for those requests, and it does not fix the root cause: in particular, when request-parallel prefills are enabled, the slowing down also affects short-prompt requests that are scheduled after the long-prompt one. Additionally, in times of high load, there can be a very small chance for long-prompt requests to be scheduled within reasonable time. At TNG, we implemented a similar strategy in an API for batch requests which are scheduled with very low priority.
- (b) You could have a separate inference-server for long-prompt requests and a router that forwards requests depending on load and prompt lengths. This approach requires more GPU resources but the inference server for short-context requests has lower requirements on GPU memory (for example, Llama-3.3-70B needs four H100 for a context length of 130k tokens but a second deployment with two H100 could already serve requests with context length of <10k tokens). However, a sophisticated router design is required in order to optimize resource utilization. For example, when there are no long-prompt requests the larger inference server should still be utilized.
- (c) You could have separate inference engines for prefill and decode. This architecture of disaggregated prefill combines multiple vLLM deployments, each of which runs only prefill or only decode. After finishing the prefill phase, the KV cache is transferred to the decode worker which causes a small communication overhead. But since prefill and decode run isolated on different GPUs, there is no direct disruption of decodes caused by concurrent prefills anymore.
The difference between ideal concurrent processing (which would be no different from isolated requests), actual concurrent processing, and a disaggregated prefill strategy is shown by the following measurements:

Disaggregated Prefill - Optimized for Latency
Separating prefill and decode eliminates the slowing-down of token generation in presence of other requests to a large extent, which makes it a very attractive strategy. It comes at the price of a second full-size vLLM deployment (e.g. for Llama-3.3-70B, you would need four H100 GPUs for a prefill worker and another four H100 GPUs for a decode worker if you wanted to support a maximum context length of 130k tokens). Another disadvantage is the uneven GPU utilization: because prefill is compute-intensive but decode is not, the prefill worker will likely saturate GPU utilization before the decode worker does. On the other hand, large clusters could consist of different numbers of prefill and decode workers (depending on load patterns), in order to optimize resource utilization.
Disaggregated prefill is not intended to increase total throughput, rather total "goodput" (i.e. the rate of requests that satisfy latency targets). Consequently, it is not the best use of GPU resources if your application is not sensitive to latency of individual requests.
Another caveat: the disaggregated prefill feature in vLLM is still experimental, and some optimizations and features are not accessible yet. For example, there are currently lower limits on context length, and the decode worker doesn't use CUDA graphs consistently, causing the slower decode of the long-prompt request in the figure above. Fortunately, these are not fundamental obstacles and are likely going to be solved in future versions of vLLM.