Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getlilac.com/llms.txt

Use this file to discover all available pages before exploring further.

The chat completions endpoint is fully compatible with the OpenAI chat completions API. Lilac serves models via a customized fork of vLLM tuned for idle-GPU scheduling and shared warm endpoints, so you get access to both standard OpenAI parameters and vLLM-specific extras.

Endpoint

POST https://api.getlilac.com/v1/chat/completions

Basic Example

from openai import OpenAI

client = OpenAI(
    base_url="https://api.getlilac.com/v1",
    api_key="your-lilac-api-key",
)

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is GPU inference?"},
    ],
)

print(response.choices[0].message.content)

Request Parameters

Required

ParameterTypeDescription
modelstringModel ID (e.g., moonshotai/kimi-k2.6). See Models.
messagesarrayConversation history. Each message has a role (system, user, assistant, tool) and content.

Sampling

ParameterTypeDefaultDescription
temperaturefloat1.0Sampling temperature (0–2). Lower values are more deterministic.
top_pfloat1.0Nucleus sampling — considers tokens with cumulative probability >= top_p.
top_kinteger-1Limits sampling to the top K tokens. -1 disables.
min_pfloat0.0Minimum relative probability threshold for token consideration.
seedintegernullSeed for deterministic sampling (best effort).

Output

ParameterTypeDefaultDescription
max_tokensintegermodel-dependentMaximum tokens to generate.
max_completion_tokensintegernullUpper bound including reasoning tokens. Preferred for reasoning models.
ninteger1Number of completions to generate.
stopstring or arraynullUp to 4 sequences where generation stops.
streambooleanfalseStream partial token deltas via SSE.
stream_optionsobjectnullOptions like {"include_usage": true} to get token counts in the stream.

Penalties

ParameterTypeDefaultDescription
frequency_penaltyfloat0.0Penalizes tokens by frequency in output so far (-2.0 to 2.0).
presence_penaltyfloat0.0Penalizes tokens that have appeared at all (-2.0 to 2.0).
repetition_penaltyfloat1.0Multiplicative penalty on repeated tokens. 1.0 = no penalty.
logit_biasobjectnullMap of token ID → bias value (-100 to 100).

Log Probabilities

ParameterTypeDefaultDescription
logprobsbooleanfalseReturn log probabilities of output tokens.
top_logprobsintegernullNumber of most likely tokens to return at each position (0–20). Requires logprobs: true.

Tool Calling

ParameterTypeDefaultDescription
toolsarraynullList of tool definitions with type: "function" and a function schema.
tool_choicestring or object"auto""none", "auto", "required", or {"type": "function", "function": {"name": "..."}}.

Structured Output

ParameterTypeDefaultDescription
response_formatobjectnull{"type": "text"}, {"type": "json_object"}, or {"type": "json_schema", "json_schema": {...}}.

Reasoning

ParameterTypeDefaultDescription
chat_template_kwargsobjectnullToggles model behavior exposed by the chat template. Used to enable or disable chain-of-thought reasoning — see below.
Some models (like Kimi K2.6 and GLM 5.1) include chain-of-thought reasoning by default. When reasoning is active, the model’s chain-of-thought is returned in a separate reasoning field on the response message. Reasoning tokens are included in completion_tokens and count toward your usage.

Thinking toggle keys

The key that toggles reasoning is defined by each model’s chat template, not by the API, so it differs per model family:
KeyModels that honor it
thinking (bool)Moonshot chat templates — e.g. Kimi K2.6
enable_thinking (bool)Z.ai GLM / Google Gemma / SGLang-style templates — e.g. GLM 5.x, Gemma 4
Defaults also differ per model: Kimi K2.6 and GLM 5.1 have reasoning on by default, while Gemma 4 has reasoning off by default. See the Models page for per-model defaults. Unknown keys inside chat_template_kwargs are silently ignored by chat templates, so the safe, forward-compatible approach is to send both keys. This works across all current Lilac models and any future model whose template uses either convention:
{ "chat_template_kwargs": { "thinking": false, "enable_thinking": false } }
# Disable reasoning (works for both Moonshot- and GLM-style templates)
response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    extra_body={
        "chat_template_kwargs": {
            "thinking": False,
            "enable_thinking": False,
        }
    },
)
If you’re targeting a single model on purpose (e.g. for minimal payloads) and want to use only the key its template actually honors, see the per-model notes on the Models page.
GLM 5.x chain-of-thought leakage. Even with the correct toggle key, GLM 5.x models on the current vLLM build may still leak chain-of-thought into the content field, terminated by a bare </think> marker — see vllm-project/vllm#31319. Clients that require hard-suppressed output should post-process the response: when reasoning is disabled, discard everything in content up to and including the first </think> marker.
Disabling reasoning can significantly reduce token costs for straightforward queries where chain-of-thought isn’t needed. Reasoning tokens always count toward completion_tokens and your billed usage.

Streaming

Enable streaming to receive tokens as they’re generated:
stream = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {"role": "user", "content": "Write a haiku about GPUs."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Vision

Pass images as URLs or base64 data URIs in the content array:
response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "auto"
                    }
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

Tool Calling

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {"role": "user", "content": "What's the weather in SF?"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ],
)

tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name)       # "get_weather"
print(tool_call.function.arguments)  # '{"location": "San Francisco"}'

Structured Output

Force the model to return valid JSON matching a schema:
response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {"role": "user", "content": "List 3 programming languages and their year of creation."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "languages",
            "schema": {
                "type": "object",
                "properties": {
                    "languages": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "year": {"type": "integer"}
                            },
                            "required": ["name", "year"]
                        }
                    }
                },
                "required": ["languages"]
            }
        }
    },
)

Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1717000000,
  "model": "moonshotai/kimi-k2.6",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "GPU inference is the process of...",
        "reasoning": "The user is asking about...",
        "tool_calls": []
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 42,
    "total_tokens": 67
  }
}
The reasoning field is present when the model uses chain-of-thought reasoning. It is not counted separately in the response — reasoning tokens are included in completion_tokens.