Documentation Index
Fetch the complete documentation index at: https://docs.getlilac.com/llms.txt
Use this file to discover all available pages before exploring further.
The chat completions endpoint is fully compatible with the OpenAI chat completions API. Lilac serves models via a customized fork of vLLM tuned for idle-GPU scheduling and shared warm endpoints, so you get access to both standard OpenAI parameters and vLLM-specific extras.
Endpoint
POST https://api.getlilac.com/v1/chat/completions
Basic Example
from openai import OpenAI
client = OpenAI(
base_url="https://api.getlilac.com/v1",
api_key="your-lilac-api-key",
)
response = client.chat.completions.create(
model="moonshotai/kimi-k2.6",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is GPU inference?"},
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.getlilac.com/v1",
apiKey: "your-lilac-api-key",
});
const response = await client.chat.completions.create({
model: "moonshotai/kimi-k2.6",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is GPU inference?" },
],
});
console.log(response.choices[0].message.content);
curl https://api.getlilac.com/v1/chat/completions \
-H "Authorization: Bearer your-lilac-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/kimi-k2.6",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is GPU inference?"}
]
}'
Request Parameters
Required
| Parameter | Type | Description |
|---|
model | string | Model ID (e.g., moonshotai/kimi-k2.6). See Models. |
messages | array | Conversation history. Each message has a role (system, user, assistant, tool) and content. |
Sampling
| Parameter | Type | Default | Description |
|---|
temperature | float | 1.0 | Sampling temperature (0–2). Lower values are more deterministic. |
top_p | float | 1.0 | Nucleus sampling — considers tokens with cumulative probability >= top_p. |
top_k | integer | -1 | Limits sampling to the top K tokens. -1 disables. |
min_p | float | 0.0 | Minimum relative probability threshold for token consideration. |
seed | integer | null | Seed for deterministic sampling (best effort). |
Output
| Parameter | Type | Default | Description |
|---|
max_tokens | integer | model-dependent | Maximum tokens to generate. |
max_completion_tokens | integer | null | Upper bound including reasoning tokens. Preferred for reasoning models. |
n | integer | 1 | Number of completions to generate. |
stop | string or array | null | Up to 4 sequences where generation stops. |
stream | boolean | false | Stream partial token deltas via SSE. |
stream_options | object | null | Options like {"include_usage": true} to get token counts in the stream. |
Penalties
| Parameter | Type | Default | Description |
|---|
frequency_penalty | float | 0.0 | Penalizes tokens by frequency in output so far (-2.0 to 2.0). |
presence_penalty | float | 0.0 | Penalizes tokens that have appeared at all (-2.0 to 2.0). |
repetition_penalty | float | 1.0 | Multiplicative penalty on repeated tokens. 1.0 = no penalty. |
logit_bias | object | null | Map of token ID → bias value (-100 to 100). |
Log Probabilities
| Parameter | Type | Default | Description |
|---|
logprobs | boolean | false | Return log probabilities of output tokens. |
top_logprobs | integer | null | Number of most likely tokens to return at each position (0–20). Requires logprobs: true. |
| Parameter | Type | Default | Description |
|---|
tools | array | null | List of tool definitions with type: "function" and a function schema. |
tool_choice | string or object | "auto" | "none", "auto", "required", or {"type": "function", "function": {"name": "..."}}. |
Structured Output
| Parameter | Type | Default | Description |
|---|
response_format | object | null | {"type": "text"}, {"type": "json_object"}, or {"type": "json_schema", "json_schema": {...}}. |
Reasoning
| Parameter | Type | Default | Description |
|---|
chat_template_kwargs | object | null | Toggles model behavior exposed by the chat template. Used to enable or disable chain-of-thought reasoning — see below. |
Some models (like Kimi K2.6 and GLM 5.1) include chain-of-thought reasoning by default. When reasoning is active, the model’s chain-of-thought is returned in a separate reasoning field on the response message. Reasoning tokens are included in completion_tokens and count toward your usage.
Thinking toggle keys
The key that toggles reasoning is defined by each model’s chat template, not by the API, so it differs per model family:
| Key | Models that honor it |
|---|
thinking (bool) | Moonshot chat templates — e.g. Kimi K2.6 |
enable_thinking (bool) | Z.ai GLM / Google Gemma / SGLang-style templates — e.g. GLM 5.x, Gemma 4 |
Defaults also differ per model: Kimi K2.6 and GLM 5.1 have reasoning on by default, while Gemma 4 has reasoning off by default. See the Models page for per-model defaults.
Unknown keys inside chat_template_kwargs are silently ignored by chat templates, so the safe, forward-compatible approach is to send both keys. This works across all current Lilac models and any future model whose template uses either convention:
{ "chat_template_kwargs": { "thinking": false, "enable_thinking": false } }
# Disable reasoning (works for both Moonshot- and GLM-style templates)
response = client.chat.completions.create(
model="moonshotai/kimi-k2.6",
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_body={
"chat_template_kwargs": {
"thinking": False,
"enable_thinking": False,
}
},
)
const response = await client.chat.completions.create({
model: "moonshotai/kimi-k2.6",
messages: [
{ role: "user", content: "What is 2+2?" }
],
chat_template_kwargs: {
thinking: false,
enable_thinking: false,
},
});
curl https://api.getlilac.com/v1/chat/completions \
-H "Authorization: Bearer your-lilac-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/kimi-k2.6",
"messages": [
{"role": "user", "content": "What is 2+2?"}
],
"chat_template_kwargs": {
"thinking": false,
"enable_thinking": false
}
}'
If you’re targeting a single model on purpose (e.g. for minimal payloads) and want to use only the key its template actually honors, see the per-model notes on the Models page.
GLM 5.x chain-of-thought leakage. Even with the correct toggle key, GLM 5.x models on the current vLLM build may still leak chain-of-thought into the content field, terminated by a bare </think> marker — see vllm-project/vllm#31319. Clients that require hard-suppressed output should post-process the response: when reasoning is disabled, discard everything in content up to and including the first </think> marker.
Disabling reasoning can significantly reduce token costs for straightforward queries where chain-of-thought isn’t needed. Reasoning tokens always count toward completion_tokens and your billed usage.
Streaming
Enable streaming to receive tokens as they’re generated:
stream = client.chat.completions.create(
model="moonshotai/kimi-k2.6",
messages=[
{"role": "user", "content": "Write a haiku about GPUs."}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
const stream = await client.chat.completions.create({
model: "moonshotai/kimi-k2.6",
messages: [
{ role: "user", content: "Write a haiku about GPUs." }
],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Vision
Pass images as URLs or base64 data URIs in the content array:
response = client.chat.completions.create(
model="moonshotai/kimi-k2.6",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg",
"detail": "auto"
}
}
]
}
],
)
print(response.choices[0].message.content)
const response = await client.chat.completions.create({
model: "moonshotai/kimi-k2.6",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Describe this image." },
{
type: "image_url",
image_url: {
url: "https://example.com/image.jpg",
detail: "auto",
},
},
],
},
],
});
console.log(response.choices[0].message.content);
curl https://api.getlilac.com/v1/chat/completions \
-H "Authorization: Bearer your-lilac-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/kimi-k2.6",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg", "detail": "auto"}}
]
}
]
}'
response = client.chat.completions.create(
model="moonshotai/kimi-k2.6",
messages=[
{"role": "user", "content": "What's the weather in SF?"}
],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}
],
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name) # "get_weather"
print(tool_call.function.arguments) # '{"location": "San Francisco"}'
Structured Output
Force the model to return valid JSON matching a schema:
response = client.chat.completions.create(
model="moonshotai/kimi-k2.6",
messages=[
{"role": "user", "content": "List 3 programming languages and their year of creation."}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"year": {"type": "integer"}
},
"required": ["name", "year"]
}
}
},
"required": ["languages"]
}
}
},
)
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1717000000,
"model": "moonshotai/kimi-k2.6",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "GPU inference is the process of...",
"reasoning": "The user is asking about...",
"tool_calls": []
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 42,
"total_tokens": 67
}
}
The reasoning field is present when the model uses chain-of-thought reasoning. It is not counted separately in the response — reasoning tokens are included in completion_tokens.