Chat Completions API

Use this OpenAI-compatible Chat Completions endpoint to generate responses from your deployed LLM with a simple messages-based interface.

Send a prompt, tune generation settings like temperature and max tokens, and receive a structured response you can plug into your app.

POST/v1/chat/completions

Create a chat completion

Generate a response from an open-source LLM deployment served via vLLM, using an OpenAI-compatible Chat Completions interface.

Compatibility notes (vLLM):

Works with OpenAI SDKs by setting base_url / baseURL.
Some OpenAI fields may be ignored or partially supported depending on your vLLM version/model/template (e.g., certain multimodal fields). Treat unsupported fields as no-ops.

Required attributes

Name
model
Type
string
Description
The model (or deployment ID/alias) to use for this request.
Name
messages
Type
array
Description
The conversation so far. Each item is an object with a role and content. Common roles: system, developer, user, assistant, tool. Content may be a string, or (for multimodal-capable setups) an array of content parts such as { "type": "text", "text": "..." }.

Optional attributes

Name
temperature
Type
number
Description
Sampling temperature. Higher values increase randomness; lower values are more deterministic.
Name
top_p
Type
number
Description
Nucleus sampling. Alternative to temperature; tune one or the other.
Name
top_k
Type
integer
Description
(vLLM passthrough) Limits sampling to the top-k tokens. If your vLLM server supports it, this will be applied.
Name
min_p
Type
number
Description
(vLLM passthrough) Minimum probability threshold for sampling (if supported by your vLLM server).
Name
n
Type
integer
Description
Number of completions to generate for the same input. Most servers return n choices.
Name
max_tokens
Type
integer
Description
Maximum number of tokens to generate (legacy OpenAI param; still commonly used by clients).
Name
max_completion_tokens
Type
integer
Description
Maximum number of completion tokens to generate (newer OpenAI-compatible clients may send this).
Name
min_tokens
Type
integer
Description
(vLLM passthrough) Minimum number of tokens to generate before stopping (if supported).
Name
stop
Type
string | array
Description
Up to 4 stop sequences. Generation stops when any stop sequence is encountered.
Name
stream
Type
boolean
Description
If true, the response is streamed via Server-Sent Events (SSE).
Name
stream_options
Type
object
Description
Streaming configuration. If supported, you can request things like usage in the stream (e.g. { "include_usage": true }).
Name
presence_penalty
Type
number
Description
Penalizes tokens based on whether they appear in the text so far. Typical range: -2.0 to 2.0.
Name
frequency_penalty
Type
number
Description
Penalizes tokens based on their frequency in the text so far. Typical range: -2.0 to 2.0.
Name
repetition_penalty
Type
number
Description
(vLLM passthrough) Penalizes repeated tokens (common in open-source serving stacks).
Name
logprobs
Type
boolean
Description
If true, returns log probability information for output tokens (if supported by the server/model).
Name
top_logprobs
Type
integer
Description
Number of most-likely tokens to return at each position (commonly 0–20). Only meaningful when logprobs is enabled.
Name
logit_bias
Type
object
Description
Modify token likelihoods. Map of token IDs to bias values (e.g. { "1234": -5, "5678": 3 }).
Name
seed
Type
integer
Description
Best-effort deterministic sampling seed (determinism is not guaranteed across backend changes or parallelism).
Name
response_format
Type
object
Description
Constrain the output format (if supported). Common shapes include JSON mode / JSON schema depending on your client/server.
Name
tools
Type
array
Description
Tool definitions the model may call (e.g., function tools). When provided, the model may respond with tool_calls.
Name
tool_choice
Type
string | object
Description
Controls tool calling. Use "auto" to allow tool calls, "none" to disable, or an object to force a specific tool.
Name
parallel_tool_calls
Type
boolean
Description
Whether the model may emit multiple tool calls in a single turn (support varies by server/version).
Name
functions
Type
array
Description
Deprecated. Older function-calling interface. Prefer tools.
Name
function_call
Type
string | object
Description
Deprecated. Older function-calling interface. Prefer tool_choice.
Name
user
Type
string
Description
A unique identifier representing your end-user. Support may vary; servers can ignore it.
Name
metadata
Type
object
Description
Developer-defined metadata to attach to the request (key/value pairs).
Name
extra_body
Type
object
Description
(Pass-through) Any additional vLLM/server-specific parameters you want to forward without changing the OpenAI-compatible payload. If present, the platform merges this object into the request body sent to the underlying vLLM server.

Request

POST

/v1/chat/completions

curl "$BASE_URL/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-or-deployment-id",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Give me a one-sentence summary of vLLM." }
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 128,
    "stream": false,
    "extra_body": {
      "top_k": 50,
      "repetition_penalty": 1.05
    }
  }'

Response

{
  "id": "chatcmpl_01ABCDEF234567890",
  "object": "chat.completion",
  "created": 1739251200,
  "model": "your-model-or-deployment-id",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "vLLM is a high-throughput LLM serving engine that optimizes inference with efficient batching and memory management."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 23,
    "total_tokens": 47
  }
}