Chat Completions API
Use this OpenAI-compatible Chat Completions endpoint to generate responses from your deployed LLM with a simple messages-based interface.
Send a prompt, tune generation settings like temperature and max tokens, and receive a structured response you can plug into your app.
Create a chat completion
Generate a response from an open-source LLM deployment served via vLLM, using an OpenAI-compatible Chat Completions interface.
Compatibility notes (vLLM):
- Works with OpenAI SDKs by setting
base_url/baseURL. - Some OpenAI fields may be ignored or partially supported depending on your vLLM version/model/template (e.g., certain multimodal fields). Treat unsupported fields as no-ops.
Required attributes
- Name
model- Type
- string
- Description
The model (or deployment ID/alias) to use for this request.
- Name
messages- Type
- array
- Description
The conversation so far. Each item is an object with a
roleandcontent. Common roles:system,developer,user,assistant,tool. Content may be a string, or (for multimodal-capable setups) an array of content parts such as{ "type": "text", "text": "..." }.
Optional attributes
- Name
temperature- Type
- number
- Description
Sampling temperature. Higher values increase randomness; lower values are more deterministic.
- Name
top_p- Type
- number
- Description
Nucleus sampling. Alternative to temperature; tune one or the other.
- Name
top_k- Type
- integer
- Description
(vLLM passthrough) Limits sampling to the top-k tokens. If your vLLM server supports it, this will be applied.
- Name
min_p- Type
- number
- Description
(vLLM passthrough) Minimum probability threshold for sampling (if supported by your vLLM server).
- Name
n- Type
- integer
- Description
Number of completions to generate for the same input. Most servers return
nchoices.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate (legacy OpenAI param; still commonly used by clients).
- Name
max_completion_tokens- Type
- integer
- Description
Maximum number of completion tokens to generate (newer OpenAI-compatible clients may send this).
- Name
min_tokens- Type
- integer
- Description
(vLLM passthrough) Minimum number of tokens to generate before stopping (if supported).
- Name
stop- Type
- string | array
- Description
Up to 4 stop sequences. Generation stops when any stop sequence is encountered.
- Name
stream- Type
- boolean
- Description
If
true, the response is streamed via Server-Sent Events (SSE).
- Name
stream_options- Type
- object
- Description
Streaming configuration. If supported, you can request things like usage in the stream (e.g.
{ "include_usage": true }).
- Name
presence_penalty- Type
- number
- Description
Penalizes tokens based on whether they appear in the text so far. Typical range:
-2.0to2.0.
- Name
frequency_penalty- Type
- number
- Description
Penalizes tokens based on their frequency in the text so far. Typical range:
-2.0to2.0.
- Name
repetition_penalty- Type
- number
- Description
(vLLM passthrough) Penalizes repeated tokens (common in open-source serving stacks).
- Name
logprobs- Type
- boolean
- Description
If
true, returns log probability information for output tokens (if supported by the server/model).
- Name
top_logprobs- Type
- integer
- Description
Number of most-likely tokens to return at each position (commonly
0–20). Only meaningful whenlogprobsis enabled.
- Name
logit_bias- Type
- object
- Description
Modify token likelihoods. Map of token IDs to bias values (e.g.
{ "1234": -5, "5678": 3 }).
- Name
seed- Type
- integer
- Description
Best-effort deterministic sampling seed (determinism is not guaranteed across backend changes or parallelism).
- Name
response_format- Type
- object
- Description
Constrain the output format (if supported). Common shapes include JSON mode / JSON schema depending on your client/server.
- Name
tools- Type
- array
- Description
Tool definitions the model may call (e.g., function tools). When provided, the model may respond with
tool_calls.
- Name
tool_choice- Type
- string | object
- Description
Controls tool calling. Use
"auto"to allow tool calls,"none"to disable, or an object to force a specific tool.
- Name
parallel_tool_calls- Type
- boolean
- Description
Whether the model may emit multiple tool calls in a single turn (support varies by server/version).
- Name
functions- Type
- array
- Description
Deprecated. Older function-calling interface. Prefer
tools.
- Name
function_call- Type
- string | object
- Description
Deprecated. Older function-calling interface. Prefer
tool_choice.
- Name
user- Type
- string
- Description
A unique identifier representing your end-user. Support may vary; servers can ignore it.
- Name
metadata- Type
- object
- Description
Developer-defined metadata to attach to the request (key/value pairs).
- Name
extra_body- Type
- object
- Description
(Pass-through) Any additional vLLM/server-specific parameters you want to forward without changing the OpenAI-compatible payload. If present, the platform merges this object into the request body sent to the underlying vLLM server.
Request
curl "$BASE_URL/v1/chat/completions" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-or-deployment-id",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Give me a one-sentence summary of vLLM." }
],
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 128,
"stream": false,
"extra_body": {
"top_k": 50,
"repetition_penalty": 1.05
}
}'
Response
{
"id": "chatcmpl_01ABCDEF234567890",
"object": "chat.completion",
"created": 1739251200,
"model": "your-model-or-deployment-id",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "vLLM is a high-throughput LLM serving engine that optimizes inference with efficient batching and memory management."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 23,
"total_tokens": 47
}
}