Architecture
The Sync Loop
The operator runs a reconciliation loop every 30 seconds for each GPU pool:Schedule check
Is the current time within the pool’s availability window? If not, the operator skips this pool.
GPU discovery
The operator scans nodes matching the pool’s
nodeSelector and counts available GPUs, distinguishing between your pods and Lilac inference pods.Capacity calculation
Applies your configured limits —
maxGPUs and maxUtilizationPct — to determine how many GPUs Lilac can use.Control plane sync
Sends a full state snapshot (node inventory, running workloads, draining workloads) to the Lilac control plane and receives back a desired state with workload assignments.
Connection States
The operator maintains a connection state with the control plane:| State | Meaning |
|---|---|
| Connected | Syncing normally |
| Degraded | Sync failed, retrying on next cycle |
| Draining | Disconnected for over 10 minutes — gracefully shutting down inference pods |
Preemption
When your workloads need GPUs back, the operator handles it automatically. See Preemption for details on how this works.What Gets Deployed
When the control plane assigns a workload, the operator creates a pod running vLLM — a high-performance inference engine. Each pod:- Runs a single model
- Uses one or more GPUs on a single node
- Is labeled and managed by the operator
- Is automatically cleaned up when no longer needed

