Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getlilac.com/llms.txt

Use this file to discover all available pages before exploring further.

Available Models

Lilac currently supports the following models. We’re actively adding more — reach out if there’s a model you’d like to see.
ModelModel IDContext LengthQuantizationInput PriceCache Read PriceOutput Price
Kimi K2.6moonshotai/kimi-k2.6262,144 tokensINT4$0.70 / M tokens$0.20 / M tokens$3.50 / M tokens
GLM 5.1zai-org/glm-5.1202,800 tokensFP8$0.90 / M tokens$0.27 / M tokens$3.00 / M tokens
Gemma 4google/gemma-4-31b-it262,100 tokensBF16$0.11 / M tokens$0.35 / M tokens
MiniMax M2.7minimaxai/minimax-m2.7204,800 tokensFP8$0.30 / M tokens$0.055 / M tokens$1.20 / M tokens
Cache read is the rate for repeated input tokens served from cache. It’s billed at a lower rate than standard input tokens on supported models. Models that don’t support cached input tokens are marked with .
More models are coming soon. Request a model by emailing contact@getlilac.com.

Kimi K2.6

Moonshot AI’s flagship multimodal reasoning model. 1T total parameters (32B activated) with a Mixture-of-Experts architecture.

Kimi K2.6 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

CapabilityStatusDetails
Text inputSupportedChat, instructions, system prompts
Image inputSupportedNative multimodal — pass images via image_url in messages
Text outputSupportedCompletions, structured JSON, tool calls
Reasoning (thinking)On by defaultChain-of-thought returned in reasoning field. Kimi K2.6’s Moonshot chat template honors chat_template_kwargs: {"thinking": false} (the enable_thinking key is ignored here). For forward compatibility across models, see the Reasoning section.
Tool callingSupportedFunction definitions with automatic argument extraction
Structured outputSupportedresponse_format with json_object or json_schema
From the Kimi K2.6 model card:
ModeTemperatureTop P
Thinking (default)1.00.95
Instant (thinking off)0.60.95

Vision

Kimi K2.6 natively supports image inputs. Pass images as base64 data URIs or URLs in the content array:
response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "auto"
                    }
                }
            ]
        }
    ],
)
You can also pass base64-encoded images:
import base64

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_b64}"}
                }
            ]
        }
    ],
)

GLM 5.1

Z.ai’s next-generation flagship model for agentic engineering. 754B total parameters in a Mixture-of-Experts architecture, with state-of-the-art coding capabilities — it holds up over long-horizon tasks, handles ambiguous problems well, and sustains hundreds of tool calls per run. 202.8K context window, 131.1K max output. MIT licensed.

GLM 5.1 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

CapabilityStatusDetails
Text inputSupportedChat, instructions, system prompts
Text outputSupportedCompletions, structured JSON, tool calls
Image inputNot supportedGLM 5.1 is text-only
Reasoning (thinking)On by defaultChain-of-thought returned in reasoning field. GLM 5.1’s chat template honors chat_template_kwargs: {"enable_thinking": false} (the thinking key is ignored here, and leaving it set alone will cause the chain-of-thought to leak into content terminated by a </think> marker — see the Reasoning section for details and the forward-compatible form).
Tool callingSupportedFunction definitions with automatic argument extraction — strong performance on agentic tasks
Structured outputSupportedresponse_format with json_object
From the Z.ai platform docs:
ModeTemperatureTop P
Thinking (default)1.00.95
Instant (thinking off)0.60.95

Example request

response = client.chat.completions.create(
    model="zai-org/glm-5.1",
    messages=[
        {"role": "user", "content": "Write a haiku about idle GPUs."}
    ],
)

Gemma 4

Google’s open-weight multimodal model. 31B parameters with native support for text, image, and video inputs. 262K context window with BF16 precision. Released under the Gemma license.

Gemma 4 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

CapabilityStatusDetails
Text inputSupportedChat, instructions, system prompts
Image inputSupportedNative multimodal — pass images via image_url in messages
Video inputSupportedPass video frames as a sequence of images
Text outputSupportedCompletions, structured JSON
Reasoning (thinking)Off by defaultChain-of-thought returned in reasoning field when enabled. Gemma 4’s chat template honors chat_template_kwargs: {"enable_thinking": true} (the thinking key is ignored here). Unlike Kimi K2.6 and GLM 5.1, thinking is off by default — you must opt in. See the Reasoning section for the forward-compatible form.
Tool callingSupportedFunction definitions with automatic argument extraction
Structured outputSupportedresponse_format with json_object or json_schema
Gemma 4 chain-of-thought may leak into content. vLLM’s Gemma 4 reasoning parser can fail to populate the reasoning field when special tokens are stripped before the parser runs — see vllm-project/vllm#38855. When reasoning is enabled, clients that require a clean split should post-process by treating text inside <|channel|>thought ... <|channel|> markers as reasoning.
Structured output caveat. On current vLLM builds, combining --reasoning-parser gemma4 with enable_thinking: false can silently disable xgrammar-backed structured output — see vllm-project/vllm#39130. If you rely on response_format: json_schema with Gemma 4, leave thinking enabled or validate output client-side.

Enabling reasoning

Gemma 4 is the only model in the catalog where reasoning is off by default. To turn it on, use the forward-compatible form recommended in the Reasoning section:
response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "enable_thinking": True,
        }
    },
)

Vision

Gemma 4 natively supports image inputs. Pass images as base64 data URIs or URLs in the content array:
response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "auto"
                    }
                }
            ]
        }
    ],
)

Video

Gemma 4 can process video by accepting a sequence of frames as images. Extract frames from your video and pass them as multiple image_url entries:
import base64
import cv2

# Extract frames from video
video = cv2.VideoCapture("video.mp4")
frames = []
while video.isOpened():
    ret, frame = video.read()
    if not ret:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    frames.append(base64.b64encode(buffer).decode())
video.release()

# Sample frames to fit context window
sampled = frames[::len(frames) // 8][:8]

response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what happens in this video."},
                *[
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{frame}"}
                    }
                    for frame in sampled
                ]
            ]
        }
    ],
)

MiniMax M2.7

MiniMax’s text-only language model. 204.8K context window with reasoning, tool calling, and structured output support over an OpenAI-compatible API.

MiniMax M2.7 on Hugging Face

Model card, benchmarks, and deployment guides.

Capabilities

CapabilityStatusDetails
Text inputSupportedChat, instructions, system prompts
Text outputSupportedCompletions, structured JSON, tool calls
Image inputNot supportedMiniMax M2.7 is text-only
Reasoning (thinking)SupportedChain-of-thought returned in the reasoning field. See the Reasoning section for forward-compatible toggles.
Tool callingSupportedFunction definitions with automatic argument extraction
Structured outputSupportedresponse_format with json_object or json_schema

Limits

LimitValue
Context length204,800 tokens
Max completion tokens204,800 tokens

Supported parameters

MiniMax M2.7 accepts the following sampling and request parameters via the OpenAI-compatible API: temperature, top_p, top_k, max_tokens, stop, frequency_penalty, presence_penalty, repetition_penalty, seed, min_p, logit_bias, logprobs, top_logprobs, response_format, structured_outputs, tools, tool_choice.

Example request

response = client.chat.completions.create(
    model="minimaxai/minimax-m2.7",
    messages=[
        {"role": "user", "content": "Write a haiku about idle GPUs."}
    ],
)

Listing Models via API

from openai import OpenAI

client = OpenAI(
    base_url="https://api.getlilac.com/v1",
    api_key="your-lilac-api-key",
)

models = client.models.list()
for model in models:
    print(model.id)