GPU Pool Configuration

A GPU pool is a custom resource that tells the operator which GPUs in your cluster are available for Lilac inference workloads. You control everything — which nodes, how many GPUs, what hours, and how preemption works.

Creating a GPU Pool

Quickstart

Apply a basic GPUPool resource to your cluster:

apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  cache:
    enabled: true
    capacity: 1000Gi
  workloads:
    inference: true

kubectl apply -f gpu-pool.yaml

Fully featured example

Use a fuller manifest when you want to cap the number of GPUs, define availability windows, or configure preemption behavior:

apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  capacity:
    maxGPUs: 64
    maxUtilizationPct: 75
  schedule:
    mode: scheduled
    timezone: America/New_York
    windows:
      - days: [mon, tue, wed, thu, fri]
        start: "18:00"
        end: "08:00"
      - days: [sat, sun]    # all day
  preemption:
    gracePeriod: 30s
    priority: tenant
  cache:
    enabled: true
    capacity: 1000Gi
  hfTokenSecretRef:
    name: huggingface
    key: token
  workloads:
    inference: true

Model Cache

Setting up a model cache is optional, but highly recommended. The cache keeps downloaded model weights on each node so repeat cold starts do not need to fetch everything from Hugging Face again. With a warm cache, cold starts can be reduced by up to 80%, which lets idle GPUs start serving workloads and earning money more quickly. A 1 TB cache is a good baseline and is included in the initial example above. For larger GPUs such as H200s, B200s, and B300s, use 2 TB or more for the fastest cold starts because those GPUs typically serve larger models.

Starting from scratch

If you are creating a new GPUPool, include cache in the resource:

apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  cache:
    enabled: true
    capacity: 1000Gi
  workloads:
    inference: true

Modify an existing GPU pool

If you already have a GPUPool, patch it to enable the cache:

kubectl -n lilac-system patch gpupool b200-gpu-pool --type merge -p '{
  "spec": {
    "cache": {
      "enabled": true,
      "capacity": "1000Gi"
    }
  }
}'

Hugging Face Token

Setting up a Hugging Face token is optional, but highly recommended. The token avoids Hugging Face rate limits for unauthenticated downloads and allows models to download at full speed. Generate a Hugging Face access token, then create a Kubernetes Secret in the same namespace as your GPUPool:

kubectl -n {GPU pool namespace} create secret generic huggingface --from-literal=token={hf_token}

For the examples below, the GPU pool namespace is lilac-system.

Starting from scratch

If you are creating a new GPUPool, include hfTokenSecretRef in the resource:

apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: b200-gpu-pool
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: B200
  hfTokenSecretRef:
    name: huggingface
    key: token
  workloads:
    inference: true

Apply it with:

kubectl apply -f gpu-pool.yaml

Modify an existing GPU pool

If you already have a GPUPool, patch it to attach the Hugging Face token secret:

kubectl -n lilac-system patch gpupool b200-gpu-pool --type merge -p '{
  "spec": {
    "hfTokenSecretRef": {
      "name": "huggingface",
      "key": "token"
    }
  }
}'

Configuration Reference

`nodeSelector`

Standard Kubernetes label selector. Only nodes matching these labels are included in the pool.

nodeSelector:
  nvidia.com/gpu.product: B200    # GPU model
  topology.kubernetes.io/zone: us-east-1a  # Optional: limit to a zone

`capacity`

Control how much of your GPU fleet Lilac can use.

Field	Type	Description
`maxGPUs`	integer	Maximum number of GPUs Lilac can use across all nodes
`maxUtilizationPct`	integer (0–100)	Maximum percentage of matching GPUs Lilac can consume. If omitted, no percentage cap is applied

`schedule`

Define when GPUs are available for Lilac workloads.

Mode	Behavior
`always`	GPUs are always available (respecting capacity limits)
`scheduled`	GPUs are only available during defined time windows

schedule:
  mode: scheduled
  timezone: America/New_York
  windows:
    - days: [mon, tue, wed, thu, fri]
      start: "18:00"
      end: "08:00"
    - days: [sat, sun]    # all day — omit start/end

Use mode: always if you have dedicated GPUs that aren’t used for other workloads. Use mode: scheduled to share GPUs between your workloads (daytime) and Lilac (evenings/weekends).

`preemption`

Controls what happens when your workloads need GPUs back.

Field	Type	Description
`gracePeriod`	duration	Time given to inference pods to finish in-flight requests before termination
`priority`	string	`tenant` means your workloads always take priority

`cache`

Configures a shared Hugging Face model cache on each node in the pool. Omitting this block disables caching, so vLLM pods download model weights from Hugging Face on every cold start.

Field	Type	Description
`enabled`	boolean	Enable the shared model cache and cache pruner. Defaults to `true` when `cache` is configured
`capacity`	quantity	Default per-node cache size. Use `1000Gi` as a baseline, or more for larger GPUs such as H200s, B200s, and B300s
`retention.maxAge`	duration	Evict cached models older than this duration. Defaults to `720h`
`overrides`	array	Per-node cache capacity overrides selected by node labels

cache:
  enabled: true
  capacity: 1000Gi

`hfTokenSecretRef`

References the Kubernetes Secret key that stores your Hugging Face access token.

Field	Type	Description
`name`	string	Secret name in the same namespace as the `GPUPool`
`key`	string	Secret key containing the Hugging Face token

hfTokenSecretRef:
  name: huggingface
  key: token

`workloads`

Toggle which workload types this pool accepts.

Field	Type	Description
`inference`	boolean	Allow inference workloads on this pool

Multiple Pools

You can create multiple GPU pools for different hardware or schedules:

# Pool for A100 GPUs — always available
apiVersion: gpu.getlilac.com/v1alpha1
kind: GPUPool
metadata:
  name: dedicated-a100s
  namespace: lilac-system
spec:
  nodeSelector:
    nvidia.com/gpu.product: A100
  capacity:
    maxGPUs: 4
  schedule:
    mode: always
  preemption:
    gracePeriod: 30s
    priority: tenant
  workloads:
    inference: true

Checking Pool Status

kubectl get gpupool -n lilac-system

NAME              PHASE     GPUS   IDLE   WORKLOADS   AGE
b200-gpu-pool     Active    8      6      3           2d
dedicated-a100s   Active    4      4      2           1d

For detailed status:

kubectl describe gpupool b200-gpu-pool -n lilac-system

​Creating a GPU Pool

​Quickstart

​Fully featured example

​Model Cache

​Starting from scratch

​Modify an existing GPU pool

​Hugging Face Token

​Starting from scratch

​Modify an existing GPU pool

​Configuration Reference

​nodeSelector

​capacity

​schedule

​preemption

​cache

​hfTokenSecretRef

​workloads

​Multiple Pools

​Checking Pool Status

Creating a GPU Pool

Quickstart

Fully featured example

Model Cache

Starting from scratch

Modify an existing GPU pool

Hugging Face Token

Starting from scratch

Modify an existing GPU pool

Configuration Reference

`nodeSelector`

`capacity`

`schedule`

`preemption`

`cache`

`hfTokenSecretRef`

`workloads`

Multiple Pools

Checking Pool Status