Networking for Distributed Inference in llm-d
· 18 min read
Networking: The Critical Path in P/D Disaggregation​
llm-d's prefill-decode disaggregation unlocks significant efficiency gains by separating compute-heavy prefill from memory-bandwidth-heavy decode onto dedicated GPU pools. But it introduces a hard dependency on the network: the KV Cache must be transferred from prefill to decode before the first token can be generated. This transfer time lands directly on the Time to First Token (TTFT) — making networking a first-order concern for end-to-end inference latency.
This post dives into llm-d's networking stack — how it works today and how it's evolving in collaboration with NVIDIA.


