While exploring the [llm-d project]1, I ran into something interesting: the [LeaderWorkerSet APIs]2 in Kubernetes. They open up some fascinating possibilities for disaggregated inference and for scaling prefill vs. decode workers individually.
LeaderWorkerSet (LWS) is an API for deploying a group of pods as a unit of replication. It aims to address common deployment patterns of AI/ML inference workloads, especially multi-host inference workloads where the LLM will be sharded and run across multiple devices on multiple nodes2.
🔑 Key Features
- Unified Pod Grouping: Deploy a leader pod alongside multiple worker pods as a cohesive unit.
- Dual Templates: Specify separate templates for the leader and worker pods.
- Gang Scheduling: Schedule all pods in a group simultaneously, ensuring consistency.
- Rolling Updates: Perform updates at the group level, maintaining application stability.
- Topology-Aware Placement: Control pod placement across nodes to optimize resource utilization.
- Failure Handling: Implement all-or-nothing restarts for pods within a group to maintain integrity.
Basic Example
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: leaderworkerset-sample
spec:
replicas: 3
leaderWorkerTemplate:
size: 4
workerTemplate:
spec:
containers:
- name: nginx
image: nginxinc/nginx-unprivileged:1.27
resources:
limits:
cpu: "100m"
requests:
cpu: "50m"
ports:
- containerPort: 8080kubectl get pods --selector=leaderworkerset.sigs.k8s.io/name=leaderworkerset-sample
NAME READY STATUS RESTARTS AGE
leaderworkerset-sample-0 1/1 Running 0 6m10s
leaderworkerset-sample-0-1 1/1 Running 0 6m10s
leaderworkerset-sample-0-2 1/1 Running 0 6m10s
leaderworkerset-sample-0-3 1/1 Running 0 6m10s
leaderworkerset-sample-1 1/1 Running 0 6m10s
leaderworkerset-sample-1-1 1/1 Running 0 6m10s
leaderworkerset-sample-1-2 1/1 Running 0 6m10s
leaderworkerset-sample-1-3 1/1 Running 0 6m10s
leaderworkerset-sample-2 1/1 Running 0 6m10s
leaderworkerset-sample-2-1 1/1 Running 0 6m10s
leaderworkerset-sample-2-2 1/1 Running 0 6m10s
leaderworkerset-sample-2-3 1/1 Running 0 6m10s📊 Mermaid Diagram: LWS Architecture
To visualize the architecture of an LWS deployment, consider the following Mermaid diagram:
graph TD A[Leader Pod] --> B[Worker Pod 1] A --> C[Worker Pod 2] A --> D[Worker Pod 3] B --> E[Inference Task] C --> E D --> E E --> F[Model Output]
This diagram illustrates a leader pod coordinating multiple worker pods to handle inference tasks, culminating in the generation of model outputs.
Topology Annotations
The LWS annotation leaderworkerset.sigs.k8s.io/exclusive-topology defines a 1:1 LWS replica to topology placement. For example, you want an LWS replica to be scheduled on the same rack in order to maximize cross-node communication for distributed inference. This can be done as follows:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: leaderworkerset-sample
annotations:
leaderworkerset.sigs.k8s.io/exclusive-topology: rack
spec:
replicas: 3
leaderWorkerTemplate:
...The LWS annotation leaderworkerset.sigs.k8s.io/subgroup-exclusive-topology defines a 1:1 between an LWS subgroup to topology placement. This can be useful for disaggregated serving in order to place the prefill pod group in the same rack, but on a separate rack from the decode pod group, assuming same hardware requirements.
metadata:
name: leaderworkerset-sample
annotations:
leaderworkerset.sigs.k8s.io/subgroup-exclusive-topology: rack
spec:
replicas: 3
leaderWorkerTemplate:
subGroupPolicy:
subGroupSize: 2
size: 4Deployment and Installation
Prerequisites
Before deploying LWS, ensure your cluster meets these requirements3:
- Kubernetes cluster with version >= 1.26
- At least 1 node with 1+ CPUs and 1G of memory available for the LeaderWorkerSet controller
- For clusters with version 1.26, you need to manually enable the feature gate for Start Ordinal (enabled by default in versions > 1.26)
Installation
Install LWS by following the installation guide². The installation process involves deploying the LWS controller and CRDs to your Kubernetes cluster.
LLM Workload Deployment
LeaderWorkerSet is particularly well-suited for Large Language Model (LLM) workloads that require multi-host, multi-node distributed inference. Here are the key use cases and deployment patterns:
vLLM Integration
LWS integrates seamlessly with vLLM for distributed model serving4. Key requirements include:
- At least two Kubernetes nodes, each with 8 GPUs
- vLLM can be deployed with LWS on Kubernetes for distributed model serving
For large models like Llama 3.1 405B FP16, which require more than 750 GB GPU memory, LWS provides an essential deployment pattern⁴. The model sharding and execution across multiple devices on multiple nodes makes it the only viable solution for serving such large models5.
Multi-Node Inference Patterns
LWS addresses several critical challenges in LLM deployment:
- Model Sharding: Distributes large models across multiple nodes and GPUs
- Cross-Node Communication: Optimizes placement for maximum inter-node communication efficiency
- Resource Management: Ensures proper allocation of GPU resources across the cluster
- Fault Tolerance: Maintains service availability through coordinated pod management
Production Use Cases
LWS has been successfully deployed for serving state-of-the-art models such as6:
- DeepSeek-R1 671B
- Llama 3.1 405B
- Other large transformer models requiring multi-host inference
The API is both cloud agnostic and accelerator agnostic, capable of running on both GPUs and TPUs across different cloud providers5.
Performance Considerations
When deploying LLM workloads with LWS, consider:
- Model Loading Time: Large models (e.g., Llama3.1 70B uses 140GB disk space) can take significant time to load7
- Resource Utilization: Proper scheduling prevents GPU idle time during model loading
- Network Topology: Use topology annotations to optimize inter-node communication
- Memory Management: Account for both model weights and KV cache storage requirements5
Example LLM Deployment Pattern
For disaggregated serving architectures, you can use subgroup topology placement to separate prefill and decode operations while maintaining optimal hardware utilization and network communication patterns.