One post tagged with "UCCL"

Networking for Distributed Inference in llm-d

June 8, 2026 · 18 min read

Pravein Govindan Kannan

Staff Research Scientist, IBM

Liran Schour

Senior Research Scientist, IBM Research

Aleksander Slominski

Senior Research Scientist, IBM Research

Raj Joshi

Senior Machine Learning Engineer, Red Hat

Nicolò Lucchesi

Senior Machine Learning Engineer, Red Hat

Carlos Costa

Distinguished Engineer, IBM

Moein Khazraee

Senior Architect, NVIDIA

Omri Kahalon

Senior Manager, NVIDIA

Networking: The Critical Path in P/D Disaggregation

llm-d's prefill-decode disaggregation unlocks significant efficiency gains by separating compute-heavy prefill from memory-bandwidth-heavy decode onto dedicated GPU pools. But it introduces a hard dependency on the network: the KV Cache must be transferred from prefill to decode before the first token can be generated. This transfer time lands directly on the Time to First Token (TTFT) — making networking a first-order concern for end-to-end inference latency.

This post dives into llm-d's networking stack — how it works today and how it's evolving in collaboration with NVIDIA.

Networking: The Critical Path in P/D Disaggregation​

Networking: The Critical Path in P/D Disaggregation