GPU Preemption

Your workloads always come first. When your cluster needs GPUs that Lilac is currently using, the operator automatically and gracefully reclaims them.

How Preemption Works

When the operator detects that your workloads need GPUs, it:

Selects inference pods to evict — using last-in-first-out (LIFO) ordering, the most recently created Lilac pods are evicted first
Initiates graceful drain — the selected pods receive a shutdown signal and are given the configured gracePeriod to finish in-flight requests
Force-deletes if needed — pods that haven’t terminated after the grace period are force-deleted
Reports to control plane — the operator notifies Lilac so traffic is rerouted to other available GPUs across the network

This entire process typically completes in under 60 seconds.

Preemption Triggers

Trigger	Description
Tenant reclaim	Your pod needs a GPU that Lilac is currently using
Schedule inactive	The availability window has closed
Inference disabled	You set `workloads.inference: false` on the pool
Disconnected	The operator lost contact with the control plane for over 10 minutes
Scale down	The control plane decided to reduce workloads on your cluster
Unhealthy	The health tracker detected issues with a workload pod

Grace Period

The gracePeriod in your GPU pool config controls how long inference pods have to finish in-flight requests:

preemption:
  gracePeriod: 30s   # Default: 30 seconds
  priority: tenant   # Your workloads always win

30 seconds is usually plenty for inference requests to complete. Increase this if you serve very long completions (e.g., large max_tokens values).

What Happens to In-Flight Requests

When an inference pod is preempted:

Completed requests are returned normally
Streaming requests receive a clean stream termination
New requests are automatically routed to other GPUs in the Lilac network — users experience no downtime

Zero Impact on Your Workloads

The operator never modifies, evicts, or interferes with your pods. It only manages pods it created (labeled as Lilac inference workloads). Your workload scheduling, resource requests, and priority classes are untouched.

GPU Pool Configuration Revenue & Payouts

⌘I

​How Preemption Works

​Preemption Triggers

​Grace Period

​What Happens to In-Flight Requests

​Zero Impact on Your Workloads

How Preemption Works

Preemption Triggers

Grace Period

What Happens to In-Flight Requests

Zero Impact on Your Workloads