Your workloads always come first. When your cluster needs GPUs that Lilac is currently using, the operator automatically and gracefully reclaims them.
How Preemption Works
When the operator detects that your workloads need GPUs, it:
- Selects inference pods to evict — using last-in-first-out (LIFO) ordering, the most recently created Lilac pods are evicted first
- Initiates graceful drain — the selected pods receive a shutdown signal and are given the configured
gracePeriod to finish in-flight requests
- Force-deletes if needed — pods that haven’t terminated after the grace period are force-deleted
- Reports to control plane — the operator notifies Lilac so traffic is rerouted to other available GPUs across the network
This entire process typically completes in under 60 seconds.
Preemption Triggers
| Trigger | Description |
|---|
| Tenant reclaim | Your pod needs a GPU that Lilac is currently using |
| Schedule inactive | The availability window has closed |
| Inference disabled | You set workloads.inference: false on the pool |
| Disconnected | The operator lost contact with the control plane for over 10 minutes |
| Scale down | The control plane decided to reduce workloads on your cluster |
| Unhealthy | The health tracker detected issues with a workload pod |
Grace Period
The gracePeriod in your GPU pool config controls how long inference pods have to finish in-flight requests:
preemption:
gracePeriod: 30s # Default: 30 seconds
priority: tenant # Your workloads always win
30 seconds is usually plenty for inference requests to complete. Increase this if you serve very long completions (e.g., large max_tokens values).
What Happens to In-Flight Requests
When an inference pod is preempted:
- Completed requests are returned normally
- Streaming requests receive a clean stream termination
- New requests are automatically routed to other GPUs in the Lilac network — users experience no downtime
Zero Impact on Your Workloads
The operator never modifies, evicts, or interferes with your pods. It only manages pods it created (labeled as Lilac inference workloads). Your workload scheduling, resource requests, and priority classes are untouched.