Documentation Index
Fetch the complete documentation index at: https://docs.getlilac.com/llms.txt
Use this file to discover all available pages before exploring further.
A GPU pool is a custom resource that tells the operator which GPUs in your cluster are available for Lilac inference workloads. You control everything — which nodes, how many GPUs, what hours, and how preemption works.
Creating a GPU Pool
Quickstart
Apply a basic GPUPool resource to your cluster:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
name: b200-gpu-pool
namespace: lilac-system
spec:
nodeSelector:
nvidia.com/gpu.product: B200
cache:
enabled: true
capacity: 1000Gi
workloads:
inference: true
kubectl apply -f gpu-pool.yaml
Fully featured example
Use a fuller manifest when you want to cap the number of GPUs, define availability windows, or configure preemption behavior:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
name: b200-gpu-pool
namespace: lilac-system
spec:
nodeSelector:
nvidia.com/gpu.product: B200
capacity:
maxGPUs: 64
maxUtilizationPct: 75
schedule:
mode: scheduled
timezone: America/New_York
windows:
- days: [mon, tue, wed, thu, fri]
start: "18:00"
end: "08:00"
- days: [sat, sun] # all day
preemption:
gracePeriod: 30s
priority: tenant
cache:
enabled: true
capacity: 1000Gi
hfTokenSecretRef:
name: huggingface
key: token
workloads:
inference: true
Model Cache
Setting up a model cache is optional, but highly recommended. The cache keeps downloaded model weights on each node so repeat cold starts do not need to fetch everything from Hugging Face again.
With a warm cache, cold starts can be reduced by up to 80%, which lets idle GPUs start serving workloads and earning money more quickly. A 1 TB cache is a good baseline and is included in the initial example above. For larger GPUs such as H200s, B200s, and B300s, use 2 TB or more for the fastest cold starts because those GPUs typically serve larger models.
Starting from scratch
If you are creating a new GPUPool, include cache in the resource:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
name: b200-gpu-pool
namespace: lilac-system
spec:
nodeSelector:
nvidia.com/gpu.product: B200
cache:
enabled: true
capacity: 1000Gi
workloads:
inference: true
Modify an existing GPU pool
If you already have a GPUPool, patch it to enable the cache:
kubectl -n lilac-system patch gpupool b200-gpu-pool --type merge -p '{
"spec": {
"cache": {
"enabled": true,
"capacity": "1000Gi"
}
}
}'
Hugging Face Token
Setting up a Hugging Face token is optional, but highly recommended. The token avoids Hugging Face rate limits for unauthenticated downloads and allows models to download at full speed.
Generate a Hugging Face access token, then create a Kubernetes Secret in the same namespace as your GPUPool:
kubectl -n {GPU pool namespace} create secret generic huggingface --from-literal=token={hf_token}
For the examples below, the GPU pool namespace is lilac-system.
Starting from scratch
If you are creating a new GPUPool, include hfTokenSecretRef in the resource:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
name: b200-gpu-pool
namespace: lilac-system
spec:
nodeSelector:
nvidia.com/gpu.product: B200
hfTokenSecretRef:
name: huggingface
key: token
workloads:
inference: true
Apply it with:
kubectl apply -f gpu-pool.yaml
Modify an existing GPU pool
If you already have a GPUPool, patch it to attach the Hugging Face token secret:
kubectl -n lilac-system patch gpupool b200-gpu-pool --type merge -p '{
"spec": {
"hfTokenSecretRef": {
"name": "huggingface",
"key": "token"
}
}
}'
Configuration Reference
nodeSelector
Standard Kubernetes label selector. Only nodes matching these labels are included in the pool.
nodeSelector:
nvidia.com/gpu.product: B200 # GPU model
topology.kubernetes.io/zone: us-east-1a # Optional: limit to a zone
capacity
Control how much of your GPU fleet Lilac can use.
| Field | Type | Description |
|---|
maxGPUs | integer | Maximum number of GPUs Lilac can use across all nodes |
maxUtilizationPct | integer (0–100) | Maximum percentage of matching GPUs Lilac can consume. If omitted, no percentage cap is applied |
schedule
Define when GPUs are available for Lilac workloads.
| Mode | Behavior |
|---|
always | GPUs are always available (respecting capacity limits) |
scheduled | GPUs are only available during defined time windows |
schedule:
mode: scheduled
timezone: America/New_York
windows:
- days: [mon, tue, wed, thu, fri]
start: "18:00"
end: "08:00"
- days: [sat, sun] # all day — omit start/end
Use mode: always if you have dedicated GPUs that aren’t used for other workloads. Use mode: scheduled to share GPUs between your workloads (daytime) and Lilac (evenings/weekends).
preemption
Controls what happens when your workloads need GPUs back.
| Field | Type | Description |
|---|
gracePeriod | duration | Time given to inference pods to finish in-flight requests before termination |
priority | string | tenant means your workloads always take priority |
cache
Configures a shared Hugging Face model cache on each node in the pool. Omitting this block disables caching, so vLLM pods download model weights from Hugging Face on every cold start.
| Field | Type | Description |
|---|
enabled | boolean | Enable the shared model cache and cache pruner. Defaults to true when cache is configured |
capacity | quantity | Default per-node cache size. Use 1000Gi as a baseline, or more for larger GPUs such as H200s, B200s, and B300s |
retention.maxAge | duration | Evict cached models older than this duration. Defaults to 720h |
overrides | array | Per-node cache capacity overrides selected by node labels |
cache:
enabled: true
capacity: 1000Gi
hfTokenSecretRef
References the Kubernetes Secret key that stores your Hugging Face access token.
| Field | Type | Description |
|---|
name | string | Secret name in the same namespace as the GPUPool |
key | string | Secret key containing the Hugging Face token |
hfTokenSecretRef:
name: huggingface
key: token
workloads
Toggle which workload types this pool accepts.
| Field | Type | Description |
|---|
inference | boolean | Allow inference workloads on this pool |
Multiple Pools
You can create multiple GPU pools for different hardware or schedules:
# Pool for A100 GPUs — always available
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
name: dedicated-a100s
namespace: lilac-system
spec:
nodeSelector:
nvidia.com/gpu.product: A100
capacity:
maxGPUs: 4
schedule:
mode: always
preemption:
gracePeriod: 30s
priority: tenant
workloads:
inference: true
Checking Pool Status
kubectl get gpupool -n lilac-system
NAME PHASE GPUS IDLE WORKLOADS AGE
b200-gpu-pool Active 8 6 3 2d
dedicated-a100s Active 4 4 2 1d
For detailed status:
kubectl describe gpupool b200-gpu-pool -n lilac-system