Your workloads always come first. When your cluster needs GPUs that Lilac is currently using, the operator automatically and gracefully reclaims them.Documentation Index
Fetch the complete documentation index at: https://docs.getlilac.com/llms.txt
Use this file to discover all available pages before exploring further.
How Preemption Works
When the operator detects that your workloads need GPUs, it:- Selects inference pods to evict — using last-in-first-out (LIFO) ordering, the most recently created Lilac pods are evicted first
- Initiates graceful drain — the selected pods receive a shutdown signal and are given the configured
gracePeriodto finish in-flight requests - Force-deletes if needed — pods that haven’t terminated after the grace period are force-deleted
- Reports to control plane — the operator notifies Lilac so traffic is rerouted to other available GPUs across the network
Preemption Triggers
| Trigger | Description |
|---|---|
| Tenant reclaim | Your pod needs a GPU that Lilac is currently using |
| Schedule inactive | The availability window has closed |
| Inference disabled | You set workloads.inference: false on the pool |
| Disconnected | The operator lost contact with the control plane for over 10 minutes |
| Scale down | The control plane decided to reduce workloads on your cluster |
| Unhealthy | The health tracker detected issues with a workload pod |
Grace Period
ThegracePeriod in your GPU pool config controls how long inference pods have to finish in-flight requests:
What Happens to In-Flight Requests
When an inference pod is preempted:- Completed requests are returned normally
- Streaming requests receive a clean stream termination
- New requests are automatically routed to other GPUs in the Lilac network — users experience no downtime

