Supported Models

Available Models

Lilac currently supports the following models. We’re actively adding more — reach out if there’s a model you’d like to see.

Model	Model ID	Context Length	Quantization	Input Price	Cache Read Price	Output Price
Kimi K2.6	`moonshotai/kimi-k2.6`	262,144 tokens	INT4	$0.70 / M tokens	$0.20 / M tokens	$3.50 / M tokens
GLM 5.2	`zai-org/glm-5.2`	524,288 tokens	FP8/NVFP4	$0.90 / M tokens	$0.27 / M tokens	$3.00 / M tokens
Gemma 4	`google/gemma-4-31b-it`	262,100 tokens	FP8	$0.11 / M tokens	—	$0.35 / M tokens
MiniMax M3	`minimaxai/minimax-m3`	1,048,576 tokens	FP8	$0.28 / M tokens	$0.05 / M tokens	$1.10 / M tokens

Cache read is the rate for repeated input tokens served from cache. It’s billed at a lower rate than standard input tokens on supported models. Models that don’t support cached input tokens are marked with —.

More models are coming soon. Request a model by emailing contact@getlilac.com.

Kimi K2.6

Moonshot AI’s flagship multimodal reasoning model. 1T total parameters (32B activated) with a Mixture-of-Experts architecture.

Kimi K2.6 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

Capability	Status	Details
Text input	Supported	Chat, instructions, system prompts
Image input	Supported	Native multimodal — pass images via `image_url` in messages
Text output	Supported	Completions, structured JSON, tool calls
Reasoning (thinking)	On by default	Chain-of-thought returned in `reasoning` field. Kimi K2.6’s Moonshot chat template honors `chat_template_kwargs: {"thinking": false}` (the `enable_thinking` key is ignored here). For forward compatibility across models, see the Reasoning section.
Tool calling	Supported	Function definitions with automatic argument extraction
Structured output	Supported	`response_format` with `json_object` or `json_schema`

Recommended Parameters

From the Kimi K2.6 model card:

Mode	Temperature	Top P
Thinking (default)	`1.0`	`0.95`
Instant (thinking off)	`0.6`	`0.95`

Vision

Kimi K2.6 natively supports image inputs. Pass images as base64 data URIs or URLs in the content array:

Python
cURL

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "auto"
                    }
                }
            ]
        }
    ],
)

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/kimi-k2.6",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this image."},
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg", "detail": "auto"}}
        ]
      }
    ]
  }'

You can also pass base64-encoded images:

import base64

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_b64}"}
                }
            ]
        }
    ],
)

GLM 5.2

Z.ai’s GLM 5.2 is a frontier-scale MoE reasoning and coding model for long-horizon agentic work. Lilac serves GLM 5.2 with a 524k-token context window, tool calling, structured output, and configurable reasoning effort.

GLM 5.2 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

Capability	Status	Details
Text input	Supported	Chat, instructions, system prompts
Text output	Supported	Completions, structured JSON, tool calls
Image input	Not supported	GLM 5.2 is text-only
Reasoning (thinking)	On by default	Chain-of-thought returned in `reasoning` field. GLM 5.2 honors `chat_template_kwargs: {"enable_thinking": false}` to disable thinking, and supports a `reasoning_effort` control with two levels — `max` and `high`. See the Reasoning section for forward-compatible toggles.
Tool calling	Supported	Function definitions with automatic argument extraction — strong performance on agentic tasks
Structured output	Supported	`response_format` with `json_object` or `json_schema`

Reasoning effort

GLM 5.2 exposes two reasoning effort levels:

high — the default when reasoning is enabled. Good balance of quality, latency, and token usage for most coding and reasoning tasks.
max — highest-quality reasoning for long-horizon agentic and complex problem-solving tasks. Higher latency and token usage than high.

Notes:

Disable thinking entirely with chat_template_kwargs.enable_thinking: false. When thinking is disabled, reasoning_effort has no effect.
reasoning_effort can be sent either as a top-level field (OpenAI-style) or inside chat_template_kwargs (vLLM extra-body form). Both are accepted.

Example requests

High reasoning (top-level)
High reasoning (chat_template_kwargs)
Disable thinking
Python

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/glm-5.2",
    "messages": [
      {"role": "user", "content": "Plan a migration for this service."}
    ],
    "reasoning_effort": "high"
  }'

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/glm-5.2",
    "messages": [
      {"role": "user", "content": "Plan a migration for this service."}
    ],
    "chat_template_kwargs": {
      "reasoning_effort": "high"
    }
  }'

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/glm-5.2",
    "messages": [
      {"role": "user", "content": "Answer directly: what is 2+2?"}
    ],
    "chat_template_kwargs": {
      "enable_thinking": false
    }
  }'

response = client.chat.completions.create(
    model="zai-org/glm-5.2",
    messages=[
        {"role": "user", "content": "Plan a migration for this service."}
    ],
    extra_body={
        "chat_template_kwargs": {
            "reasoning_effort": "high"
        }
    },
)

Thinking controls

Preserved thinking is off by default. GLM 5.2’s effective default on Lilac is to clear previous assistant thinking blocks between turns. To preserve thinking across turns, use chat_template_kwargs.clear_thinking: false on Lilac. This is equivalent to vLLM-native chat-template control. Lilac does not currently consume Z.ai’s top-level thinking object — in particular, top-level thinking.clear_thinking is ignored. For conceptual background, see Z.ai’s preserved thinking docs.

Preserve thinking
Max reasoning + preserved
Disable thinking

{
  "model": "zai-org/glm-5.2",
  "messages": [...],
  "chat_template_kwargs": {
    "clear_thinking": false
  }
}

{
  "model": "zai-org/glm-5.2",
  "messages": [...],
  "chat_template_kwargs": {
    "enable_thinking": true,
    "reasoning_effort": "max",
    "clear_thinking": false
  }
}

{
  "model": "zai-org/glm-5.2",
  "messages": [...],
  "reasoning_effort": "none"
}

Gemma 4

Google’s open-weight multimodal model. 31B parameters with native support for text, image, and video inputs. 262K context window with FP8 precision. Released under the Gemma license.

Gemma 4 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

Capability	Status	Details
Text input	Supported	Chat, instructions, system prompts
Image input	Supported	Native multimodal — pass images via `image_url` in messages
Video input	Supported	Pass video frames as a sequence of images
Text output	Supported	Completions, structured JSON
Reasoning (thinking)	Off by default	Chain-of-thought returned in `reasoning` field when enabled. Gemma 4’s chat template honors `chat_template_kwargs: {"enable_thinking": true}` (the `thinking` key is ignored here). Unlike Kimi K2.6 and GLM 5.2, thinking is off by default — you must opt in. See the Reasoning section for the forward-compatible form.
Tool calling	Supported	Function definitions with automatic argument extraction
Structured output	Supported	`response_format` with `json_object` or `json_schema`

Gemma 4 chain-of-thought may leak into content. vLLM’s Gemma 4 reasoning parser can fail to populate the reasoning field when special tokens are stripped before the parser runs — see vllm-project/vllm#38855. When reasoning is enabled, clients that require a clean split should post-process by treating text inside <|channel|>thought ... <|channel|> markers as reasoning.

Structured output caveat. On current vLLM builds, combining --reasoning-parser gemma4 with enable_thinking: false can silently disable xgrammar-backed structured output — see vllm-project/vllm#39130. If you rely on response_format: json_schema with Gemma 4, leave thinking enabled or validate output client-side.

Enabling reasoning

Gemma 4 is the only model in the catalog where reasoning is off by default. To turn it on, use the forward-compatible form recommended in the Reasoning section:

Python
cURL

response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "enable_thinking": True,
        }
    },
)

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31b-it",
    "messages": [
      {"role": "user", "content": "Prove there are infinitely many primes."}
    ],
    "chat_template_kwargs": {
      "thinking": true,
      "enable_thinking": true
    }
  }'

Vision

Gemma 4 natively supports image inputs. Pass images as base64 data URIs or URLs in the content array:

Python
cURL

response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "auto"
                    }
                }
            ]
        }
    ],
)

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31b-it",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this image."},
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg", "detail": "auto"}}
        ]
      }
    ]
  }'

Video

Gemma 4 can process video by accepting a sequence of frames as images. Extract frames from your video and pass them as multiple image_url entries:

Python
cURL

import base64
import cv2

# Extract frames from video
video = cv2.VideoCapture("video.mp4")
frames = []
while video.isOpened():
    ret, frame = video.read()
    if not ret:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    frames.append(base64.b64encode(buffer).decode())
video.release()

# Sample frames to fit context window
sampled = frames[::len(frames) // 8][:8]

response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what happens in this video."},
                *[
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{frame}"}
                    }
                    for frame in sampled
                ]
            ]
        }
    ],
)

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31b-it",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe what happens in this video."},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,FRAME_1_BASE64"}},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,FRAME_2_BASE64"}}
        ]
      }
    ]
  }'

MiniMax M3

MiniMax M3 is a frontier MoE model for coding, agents, and long-context reasoning. Lilac serves MiniMax M3 with a 1M-token context window, tool calling, structured output, and per-request thinking modes.

MiniMax M3 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

Capability	Status	Details
Text input	Supported	Chat, instructions, system prompts
Text output	Supported	Completions, structured JSON, tool calls
Image input	Supported	Native multimodal — pass images via `image_url` in messages
Video input	Supported	Pass video frames as a sequence of images
Reasoning (thinking)	Supported	Per-request thinking modes via `chat_template_kwargs.thinking_mode`. See the Reasoning section for details.
Tool calling	Supported	Function definitions with automatic argument extraction
Structured output	Supported	`response_format` with `json_object` or `json_schema`

Limits

Limit	Value
Context length	1,048,576 tokens

Thinking modes

MiniMax M3 uses a single thinking_mode control instead of a boolean toggle:

Mode	Behavior
`adaptive`	Default when unset. The model decides whether to think based on the request.
`enabled`	Always think. Best for complex reasoning and multi-step agents.
`disabled`	No thinking. Best for latency-sensitive requests.

thinking_mode is passed inside chat_template_kwargs:

{
  "chat_template_kwargs": {
    "thinking_mode": "adaptive"
  }
}

Example requests

Adaptive (default)
Force thinking
Disable thinking
Python

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimaxai/minimax-m3",
    "messages": [
      {"role": "user", "content": "Analyze this large codebase summary."}
    ],
    "chat_template_kwargs": {
      "thinking_mode": "adaptive"
    }
  }'

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimaxai/minimax-m3",
    "messages": [
      {"role": "user", "content": "Solve this multi-step planning problem."}
    ],
    "chat_template_kwargs": {
      "thinking_mode": "enabled"
    }
  }'

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimaxai/minimax-m3",
    "messages": [
      {"role": "user", "content": "Give me a concise answer."}
    ],
    "chat_template_kwargs": {
      "thinking_mode": "disabled"
    }
  }'

response = client.chat.completions.create(
    model="minimaxai/minimax-m3",
    messages=[
        {"role": "user", "content": "Solve this multi-step planning problem."}
    ],
    extra_body={
        "chat_template_kwargs": {
            "thinking_mode": "enabled"
        }
    },
)

Vision

MiniMax M3 natively supports image inputs. Pass images as base64 data URIs or URLs in the content array:

Python
cURL

response = client.chat.completions.create(
    model="minimaxai/minimax-m3",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "auto"
                    }
                }
            ]
        }
    ],
)

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimaxai/minimax-m3",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this image."},
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg", "detail": "auto"}}
        ]
      }
    ]
  }'

Video

MiniMax M3 can process video by accepting a sequence of frames as images. Extract frames from your video and pass them as multiple image_url entries:

Python
cURL

import base64
import cv2

# Extract frames from video
video = cv2.VideoCapture("video.mp4")
frames = []
while video.isOpened():
    ret, frame = video.read()
    if not ret:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    frames.append(base64.b64encode(buffer).decode())
video.release()

# Sample frames to fit context window
sampled = frames[::len(frames) // 8][:8]

response = client.chat.completions.create(
    model="minimaxai/minimax-m3",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what happens in this video."},
                *[
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{frame}"}
                    }
                    for frame in sampled
                ]
            ]
        }
    ],
)

curl https://api.getlilac.com/v1/chat/completions \
  -H "Authorization: Bearer your-lilac-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimaxai/minimax-m3",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe what happens in this video."},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,FRAME_1_BASE64"}},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,FRAME_2_BASE64"}}
        ]
      }
    ]
  }'

Listing Models via API

Python
JavaScript
cURL

from openai import OpenAI

client = OpenAI(
    base_url="https://api.getlilac.com/v1",
    api_key="your-lilac-api-key",
)

models = client.models.list()
for model in models:
    print(model.id)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.getlilac.com/v1",
  apiKey: "your-lilac-api-key",
});

const models = await client.models.list();
for await (const model of models) {
  console.log(model.id);
}

curl https://api.getlilac.com/v1/models \
  -H "Authorization: Bearer your-lilac-api-key"

​Available Models

​Kimi K2.6

Kimi K2.6 on Hugging Face

​Capabilities

​Recommended Parameters

​Vision

​GLM 5.2

GLM 5.2 on Hugging Face

​Capabilities

​Reasoning effort

​Example requests

​Thinking controls

​Gemma 4

Gemma 4 on Hugging Face

​Capabilities

​Enabling reasoning

​Vision

​Video

​MiniMax M3

MiniMax M3 on Hugging Face

​Capabilities

​Limits

​Thinking modes

​Example requests

​Vision

​Video

​Listing Models via API

Available Models

Kimi K2.6

Capabilities

Recommended Parameters

Vision

GLM 5.2

Capabilities

Reasoning effort

Example requests

Thinking controls

Gemma 4

Capabilities

Enabling reasoning

Vision

Video

MiniMax M3

Capabilities

Limits

Thinking modes

Example requests

Vision

Video

Listing Models via API