Skip to main content

One post tagged with "UCCL"

Unified Collective Communication Library (UCCL) backend

View All Tags

Networking for Distributed Inference in llm-d

· 18 min read
Pravein Govindan Kannan
Staff Research Scientist, IBM
Liran Schour
Senior Research Scientist, IBM Research
Aleksander Slominski
Senior Research Scientist, IBM Research
Raj Joshi
Senior Machine Learning Engineer, Red Hat
Nicolò Lucchesi
Senior Machine Learning Engineer, Red Hat
Carlos Costa
Distinguished Engineer, IBM
Moein Khazraee
Senior Architect, NVIDIA
Omri Kahalon
Senior Manager, NVIDIA

Networking: The Critical Path in P/D Disaggregation​

llm-d's prefill-decode disaggregation unlocks significant efficiency gains by separating compute-heavy prefill from memory-bandwidth-heavy decode onto dedicated GPU pools. But it introduces a hard dependency on the network: the KV Cache must be transferred from prefill to decode before the first token can be generated. This transfer time lands directly on the Time to First Token (TTFT) — making networking a first-order concern for end-to-end inference latency.

This post dives into llm-d's networking stack — how it works today and how it's evolving in collaboration with NVIDIA.