Skip to main content
The Lilac GPU operator is a Kubernetes controller that runs inside your cluster. It discovers idle GPUs, communicates with the Lilac control plane, and manages inference workload pods — all without touching your existing workloads.

Architecture

The Sync Loop

The operator runs a reconciliation loop every 30 seconds for each GPU pool:
1

Schedule check

Is the current time within the pool’s availability window? If not, the operator skips this pool.
2

GPU discovery

The operator scans nodes matching the pool’s nodeSelector and counts available GPUs, distinguishing between your pods and Lilac inference pods.
3

Capacity calculation

Applies your configured limits — maxGPUs and maxUtilizationPct — to determine how many GPUs Lilac can use.
4

Control plane sync

Sends a full state snapshot (node inventory, running workloads, draining workloads) to the Lilac control plane and receives back a desired state with workload assignments.
5

Reconcile

Creates new inference pods for assigned workloads, drains pods that are no longer needed, and cleans up any pods that have drifted from the desired spec.

Connection States

The operator maintains a connection state with the control plane:
StateMeaning
ConnectedSyncing normally
DegradedSync failed, retrying on next cycle
DrainingDisconnected for over 10 minutes — gracefully shutting down inference pods
A single successful sync returns the operator from Degraded to Connected.

Preemption

When your workloads need GPUs back, the operator handles it automatically. See Preemption for details on how this works.

What Gets Deployed

When the control plane assigns a workload, the operator creates a pod running vLLM — a high-performance inference engine. Each pod:
  • Runs a single model
  • Uses one or more GPUs on a single node
  • Is labeled and managed by the operator
  • Is automatically cleaned up when no longer needed
Your existing pods, namespaces, and resources are never modified.