Skip to main content
Your workloads always come first. When your cluster needs GPUs that Lilac is currently using, the operator automatically and gracefully reclaims them.

How Preemption Works

When the operator detects that your workloads need GPUs, it:
  1. Selects inference pods to evict — using last-in-first-out (LIFO) ordering, the most recently created Lilac pods are evicted first
  2. Initiates graceful drain — the selected pods receive a shutdown signal and are given the configured gracePeriod to finish in-flight requests
  3. Force-deletes if needed — pods that haven’t terminated after the grace period are force-deleted
  4. Reports to control plane — the operator notifies Lilac so traffic is rerouted to other available GPUs across the network
This entire process typically completes in under 60 seconds.

Preemption Triggers

TriggerDescription
Tenant reclaimYour pod needs a GPU that Lilac is currently using
Schedule inactiveThe availability window has closed
Inference disabledYou set workloads.inference: false on the pool
DisconnectedThe operator lost contact with the control plane for over 10 minutes
Scale downThe control plane decided to reduce workloads on your cluster
UnhealthyThe health tracker detected issues with a workload pod

Grace Period

The gracePeriod in your GPU pool config controls how long inference pods have to finish in-flight requests:
preemption:
  gracePeriod: 30s   # Default: 30 seconds
  priority: tenant   # Your workloads always win
30 seconds is usually plenty for inference requests to complete. Increase this if you serve very long completions (e.g., large max_tokens values).

What Happens to In-Flight Requests

When an inference pod is preempted:
  • Completed requests are returned normally
  • Streaming requests receive a clean stream termination
  • New requests are automatically routed to other GPUs in the Lilac network — users experience no downtime

Zero Impact on Your Workloads

The operator never modifies, evicts, or interferes with your pods. It only manages pods it created (labeled as Lilac inference workloads). Your workload scheduling, resource requests, and priority classes are untouched.