The chat completions endpoint is fully compatible with the OpenAI chat completions API. Lilac serves models via a customized fork of vLLM tuned for idle-GPU scheduling and shared warm endpoints, so you get access to both standard OpenAI parameters and vLLM-specific extras.
Endpoint
POST https://api.getlilac.com/v1/chat/completions
Basic Example
from openai import OpenAI
client = OpenAI(
base_url="https://api.getlilac.com/v1",
api_key="your-lilac-api-key",
)
response = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is GPU inference?"},
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.getlilac.com/v1",
apiKey: "your-lilac-api-key",
});
const response = await client.chat.completions.create({
model: "moonshotai/kimi-k2.5",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is GPU inference?" },
],
});
console.log(response.choices[0].message.content);
curl https://api.getlilac.com/v1/chat/completions \
-H "Authorization: Bearer your-lilac-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/kimi-k2.5",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is GPU inference?"}
]
}'
Request Parameters
Required
| Parameter | Type | Description |
|---|
model | string | Model ID (e.g., moonshotai/kimi-k2.5). See Models. |
messages | array | Conversation history. Each message has a role (system, user, assistant, tool) and content. |
Sampling
| Parameter | Type | Default | Description |
|---|
temperature | float | 1.0 | Sampling temperature (0–2). Lower values are more deterministic. |
top_p | float | 1.0 | Nucleus sampling — considers tokens with cumulative probability >= top_p. |
top_k | integer | -1 | Limits sampling to the top K tokens. -1 disables. |
min_p | float | 0.0 | Minimum relative probability threshold for token consideration. |
seed | integer | null | Seed for deterministic sampling (best effort). |
Output
| Parameter | Type | Default | Description |
|---|
max_tokens | integer | model-dependent | Maximum tokens to generate. |
max_completion_tokens | integer | null | Upper bound including reasoning tokens. Preferred for reasoning models. |
n | integer | 1 | Number of completions to generate. |
stop | string or array | null | Up to 4 sequences where generation stops. |
stream | boolean | false | Stream partial token deltas via SSE. |
stream_options | object | null | Options like {"include_usage": true} to get token counts in the stream. |
Penalties
| Parameter | Type | Default | Description |
|---|
frequency_penalty | float | 0.0 | Penalizes tokens by frequency in output so far (-2.0 to 2.0). |
presence_penalty | float | 0.0 | Penalizes tokens that have appeared at all (-2.0 to 2.0). |
repetition_penalty | float | 1.0 | Multiplicative penalty on repeated tokens. 1.0 = no penalty. |
logit_bias | object | null | Map of token ID → bias value (-100 to 100). |
Log Probabilities
| Parameter | Type | Default | Description |
|---|
logprobs | boolean | false | Return log probabilities of output tokens. |
top_logprobs | integer | null | Number of most likely tokens to return at each position (0–20). Requires logprobs: true. |
| Parameter | Type | Default | Description |
|---|
tools | array | null | List of tool definitions with type: "function" and a function schema. |
tool_choice | string or object | "auto" | "none", "auto", "required", or {"type": "function", "function": {"name": "..."}}. |
Structured Output
| Parameter | Type | Default | Description |
|---|
response_format | object | null | {"type": "text"}, {"type": "json_object"}, or {"type": "json_schema", "json_schema": {...}}. |
Reasoning
| Parameter | Type | Default | Description |
|---|
chat_template_kwargs | object | null | Pass {"thinking": false} to disable chain-of-thought reasoning, or {"thinking": true} to enable it (the default). |
Some models (like Kimi K2.5 and GLM 5.1) include chain-of-thought reasoning by default. The model’s reasoning is returned in a separate reasoning field on the response message. Reasoning tokens are included in completion_tokens and count toward your usage.
Disable reasoning to get faster, cheaper responses for simple tasks:
# Disable reasoning
response = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_body={"chat_template_kwargs": {"thinking": False}},
)
const response = await client.chat.completions.create({
model: "moonshotai/kimi-k2.5",
messages: [
{ role: "user", content: "What is 2+2?" }
],
chat_template_kwargs: { thinking: false },
});
curl https://api.getlilac.com/v1/chat/completions \
-H "Authorization: Bearer your-lilac-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/kimi-k2.5",
"messages": [
{"role": "user", "content": "What is 2+2?"}
],
"chat_template_kwargs": {"thinking": false}
}'
Disabling reasoning can significantly reduce token costs for straightforward queries where chain-of-thought isn’t needed.
Streaming
Enable streaming to receive tokens as they’re generated:
stream = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[
{"role": "user", "content": "Write a haiku about GPUs."}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
const stream = await client.chat.completions.create({
model: "moonshotai/kimi-k2.5",
messages: [
{ role: "user", content: "Write a haiku about GPUs." }
],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Vision
Pass images as URLs or base64 data URIs in the content array:
response = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg",
"detail": "auto"
}
}
]
}
],
)
print(response.choices[0].message.content)
const response = await client.chat.completions.create({
model: "moonshotai/kimi-k2.5",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Describe this image." },
{
type: "image_url",
image_url: {
url: "https://example.com/image.jpg",
detail: "auto",
},
},
],
},
],
});
console.log(response.choices[0].message.content);
curl https://api.getlilac.com/v1/chat/completions \
-H "Authorization: Bearer your-lilac-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg", "detail": "auto"}}
]
}
]
}'
response = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[
{"role": "user", "content": "What's the weather in SF?"}
],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}
],
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name) # "get_weather"
print(tool_call.function.arguments) # '{"location": "San Francisco"}'
Structured Output
Force the model to return valid JSON matching a schema:
response = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[
{"role": "user", "content": "List 3 programming languages and their year of creation."}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"year": {"type": "integer"}
},
"required": ["name", "year"]
}
}
},
"required": ["languages"]
}
}
},
)
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1717000000,
"model": "moonshotai/kimi-k2.5",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "GPU inference is the process of...",
"reasoning": "The user is asking about...",
"tool_calls": []
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 42,
"total_tokens": 67
}
}
The reasoning field is present when the model uses chain-of-thought reasoning. It is not counted separately in the response — reasoning tokens are included in completion_tokens.