Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getlilac.com/llms.txt

Use this file to discover all available pages before exploring further.

A GPU pool is a custom resource that tells the operator which GPUs in your cluster are available for Lilac inference workloads. You control everything — which nodes, how many GPUs, what hours, and how preemption works.

Creating a GPU Pool

Quickstart

Apply a basic GPUPool resource to your cluster:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  cache:
    enabled: true
    capacity: 1000Gi
  workloads:
    inference: true
kubectl apply -f gpu-pool.yaml
Use a fuller manifest when you want to cap the number of GPUs, define availability windows, or configure preemption behavior:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  capacity:
    maxGPUs: 64
    maxUtilizationPct: 75
  schedule:
    mode: scheduled
    timezone: America/New_York
    windows:
      - days: [mon, tue, wed, thu, fri]
        start: "18:00"
        end: "08:00"
      - days: [sat, sun]    # all day
  preemption:
    gracePeriod: 30s
    priority: tenant
  cache:
    enabled: true
    capacity: 1000Gi
  hfTokenSecretRef:
    name: huggingface
    key: token
  workloads:
    inference: true

Model Cache

Setting up a model cache is optional, but highly recommended. The cache keeps downloaded model weights on each node so repeat cold starts do not need to fetch everything from Hugging Face again. With a warm cache, cold starts can be reduced by up to 80%, which lets idle GPUs start serving workloads and earning money more quickly. A 1 TB cache is a good baseline and is included in the initial example above. For larger GPUs such as H200s, B200s, and B300s, use 2 TB or more for the fastest cold starts because those GPUs typically serve larger models.

Starting from scratch

If you are creating a new GPUPool, include cache in the resource:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  cache:
    enabled: true
    capacity: 1000Gi
  workloads:
    inference: true

Modify an existing GPU pool

If you already have a GPUPool, patch it to enable the cache:
kubectl -n lilac-system patch gpupool b200-gpu-pool --type merge -p '{
  "spec": {
    "cache": {
      "enabled": true,
      "capacity": "1000Gi"
    }
  }
}'

Hugging Face Token

Setting up a Hugging Face token is optional, but highly recommended. The token avoids Hugging Face rate limits for unauthenticated downloads and allows models to download at full speed. Generate a Hugging Face access token, then create a Kubernetes Secret in the same namespace as your GPUPool:
kubectl -n {GPU pool namespace} create secret generic huggingface --from-literal=token={hf_token}
For the examples below, the GPU pool namespace is lilac-system.

Starting from scratch

If you are creating a new GPUPool, include hfTokenSecretRef in the resource:
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  hfTokenSecretRef:
    name: huggingface
    key: token
  workloads:
    inference: true
Apply it with:
kubectl apply -f gpu-pool.yaml

Modify an existing GPU pool

If you already have a GPUPool, patch it to attach the Hugging Face token secret:
kubectl -n lilac-system patch gpupool b200-gpu-pool --type merge -p '{
  "spec": {
    "hfTokenSecretRef": {
      "name": "huggingface",
      "key": "token"
    }
  }
}'

Configuration Reference

nodeSelector

Standard Kubernetes label selector. Only nodes matching these labels are included in the pool.
nodeSelector:
  nvidia.com/gpu.product: B200    # GPU model
  topology.kubernetes.io/zone: us-east-1a  # Optional: limit to a zone

capacity

Control how much of your GPU fleet Lilac can use.
FieldTypeDescription
maxGPUsintegerMaximum number of GPUs Lilac can use across all nodes
maxUtilizationPctinteger (0–100)Maximum percentage of matching GPUs Lilac can consume. If omitted, no percentage cap is applied

schedule

Define when GPUs are available for Lilac workloads.
ModeBehavior
alwaysGPUs are always available (respecting capacity limits)
scheduledGPUs are only available during defined time windows
schedule:
  mode: scheduled
  timezone: America/New_York
  windows:
    - days: [mon, tue, wed, thu, fri]
      start: "18:00"
      end: "08:00"
    - days: [sat, sun]    # all day — omit start/end
Use mode: always if you have dedicated GPUs that aren’t used for other workloads. Use mode: scheduled to share GPUs between your workloads (daytime) and Lilac (evenings/weekends).

preemption

Controls what happens when your workloads need GPUs back.
FieldTypeDescription
gracePerioddurationTime given to inference pods to finish in-flight requests before termination
prioritystringtenant means your workloads always take priority

cache

Configures a shared Hugging Face model cache on each node in the pool. Omitting this block disables caching, so vLLM pods download model weights from Hugging Face on every cold start.
FieldTypeDescription
enabledbooleanEnable the shared model cache and cache pruner. Defaults to true when cache is configured
capacityquantityDefault per-node cache size. Use 1000Gi as a baseline, or more for larger GPUs such as H200s, B200s, and B300s
retention.maxAgedurationEvict cached models older than this duration. Defaults to 720h
overridesarrayPer-node cache capacity overrides selected by node labels
cache:
  enabled: true
  capacity: 1000Gi

hfTokenSecretRef

References the Kubernetes Secret key that stores your Hugging Face access token.
FieldTypeDescription
namestringSecret name in the same namespace as the GPUPool
keystringSecret key containing the Hugging Face token
hfTokenSecretRef:
  name: huggingface
  key: token

workloads

Toggle which workload types this pool accepts.
FieldTypeDescription
inferencebooleanAllow inference workloads on this pool

Multiple Pools

You can create multiple GPU pools for different hardware or schedules:
# Pool for A100 GPUs — always available
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: dedicated-a100s
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: A100
  capacity:
    maxGPUs: 4
  schedule:
    mode: always
  preemption:
    gracePeriod: 30s
    priority: tenant
  workloads:
    inference: true

Checking Pool Status

kubectl get gpupool -n lilac-system
NAME              PHASE     GPUS   IDLE   WORKLOADS   AGE
b200-gpu-pool     Active    8      6      3           2d
dedicated-a100s   Active    4      4      2           1d
For detailed status:
kubectl describe gpupool b200-gpu-pool -n lilac-system