Chat Completions API

Use this OpenAI-compatible Chat Completions endpoint to generate responses from your deployed LLM with a simple messages-based interface.

Send a prompt, tune generation settings like temperature and max tokens, and receive a structured response you can plug into your app.


POST/v1/chat/completions

Create a chat completion

Generate a response from an open-source LLM deployment served via vLLM, using an OpenAI-compatible Chat Completions interface.

Compatibility notes (vLLM):

  • Works with OpenAI SDKs by setting base_url / baseURL.
  • Some OpenAI fields may be ignored or partially supported depending on your vLLM version/model/template (e.g., certain multimodal fields). Treat unsupported fields as no-ops.

Required attributes

  • Name
    model
    Type
    string
    Description

    The model (or deployment ID/alias) to use for this request.

  • Name
    messages
    Type
    array
    Description

    The conversation so far. Each item is an object with a role and content. Common roles: system, developer, user, assistant, tool. Content may be a string, or (for multimodal-capable setups) an array of content parts such as { "type": "text", "text": "..." }.

Optional attributes

  • Name
    temperature
    Type
    number
    Description

    Sampling temperature. Higher values increase randomness; lower values are more deterministic.

  • Name
    top_p
    Type
    number
    Description

    Nucleus sampling. Alternative to temperature; tune one or the other.

  • Name
    top_k
    Type
    integer
    Description

    (vLLM passthrough) Limits sampling to the top-k tokens. If your vLLM server supports it, this will be applied.

  • Name
    min_p
    Type
    number
    Description

    (vLLM passthrough) Minimum probability threshold for sampling (if supported by your vLLM server).

  • Name
    n
    Type
    integer
    Description

    Number of completions to generate for the same input. Most servers return n choices.

  • Name
    max_tokens
    Type
    integer
    Description

    Maximum number of tokens to generate (legacy OpenAI param; still commonly used by clients).

  • Name
    max_completion_tokens
    Type
    integer
    Description

    Maximum number of completion tokens to generate (newer OpenAI-compatible clients may send this).

  • Name
    min_tokens
    Type
    integer
    Description

    (vLLM passthrough) Minimum number of tokens to generate before stopping (if supported).

  • Name
    stop
    Type
    string | array
    Description

    Up to 4 stop sequences. Generation stops when any stop sequence is encountered.

  • Name
    stream
    Type
    boolean
    Description

    If true, the response is streamed via Server-Sent Events (SSE).

  • Name
    stream_options
    Type
    object
    Description

    Streaming configuration. If supported, you can request things like usage in the stream (e.g. { "include_usage": true }).

  • Name
    presence_penalty
    Type
    number
    Description

    Penalizes tokens based on whether they appear in the text so far. Typical range: -2.0 to 2.0.

  • Name
    frequency_penalty
    Type
    number
    Description

    Penalizes tokens based on their frequency in the text so far. Typical range: -2.0 to 2.0.

  • Name
    repetition_penalty
    Type
    number
    Description

    (vLLM passthrough) Penalizes repeated tokens (common in open-source serving stacks).

  • Name
    logprobs
    Type
    boolean
    Description

    If true, returns log probability information for output tokens (if supported by the server/model).

  • Name
    top_logprobs
    Type
    integer
    Description

    Number of most-likely tokens to return at each position (commonly 0–20). Only meaningful when logprobs is enabled.

  • Name
    logit_bias
    Type
    object
    Description

    Modify token likelihoods. Map of token IDs to bias values (e.g. { "1234": -5, "5678": 3 }).

  • Name
    seed
    Type
    integer
    Description

    Best-effort deterministic sampling seed (determinism is not guaranteed across backend changes or parallelism).

  • Name
    response_format
    Type
    object
    Description

    Constrain the output format (if supported). Common shapes include JSON mode / JSON schema depending on your client/server.

  • Name
    tools
    Type
    array
    Description

    Tool definitions the model may call (e.g., function tools). When provided, the model may respond with tool_calls.

  • Name
    tool_choice
    Type
    string | object
    Description

    Controls tool calling. Use "auto" to allow tool calls, "none" to disable, or an object to force a specific tool.

  • Name
    parallel_tool_calls
    Type
    boolean
    Description

    Whether the model may emit multiple tool calls in a single turn (support varies by server/version).

  • Name
    functions
    Type
    array
    Description

    Deprecated. Older function-calling interface. Prefer tools.

  • Name
    function_call
    Type
    string | object
    Description

    Deprecated. Older function-calling interface. Prefer tool_choice.

  • Name
    user
    Type
    string
    Description

    A unique identifier representing your end-user. Support may vary; servers can ignore it.

  • Name
    metadata
    Type
    object
    Description

    Developer-defined metadata to attach to the request (key/value pairs).

  • Name
    extra_body
    Type
    object
    Description

    (Pass-through) Any additional vLLM/server-specific parameters you want to forward without changing the OpenAI-compatible payload. If present, the platform merges this object into the request body sent to the underlying vLLM server.

Request

POST
/v1/chat/completions
curl "$BASE_URL/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-or-deployment-id",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Give me a one-sentence summary of vLLM." }
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 128,
    "stream": false,
    "extra_body": {
      "top_k": 50,
      "repetition_penalty": 1.05
    }
  }'

Response

{
  "id": "chatcmpl_01ABCDEF234567890",
  "object": "chat.completion",
  "created": 1739251200,
  "model": "your-model-or-deployment-id",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "vLLM is a high-throughput LLM serving engine that optimizes inference with efficient batching and memory management."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 23,
    "total_tokens": 47
  }
}

Was this page helpful?