Chat Completions

Overview

chat/completions is the most common API endpoint for LLMs, which takes a conversation list composed of multiple messages as input to get model responses. This endpoint follows the OpenAI Chat Completions API format, making it easy to integrate with existing OpenAI-compatible code.

Important Notes

Model DifferencesDifferent model providers may support different request parameters and return different response fields. We strongly recommend consulting the model catalog for complete parameter lists and usage instructions for each model.

Response Pass-through PrincipleWisdom Gate typically does not modify model responses outside of reverse format, ensuring you receive response content consistent with the original API provider.

Streaming SupportWisdom Gate supports Server-Sent Events (SSE) for streaming responses. Set "stream": true in your request to enable real-time streaming, which is useful for chat applications.

Auto-Generated DocumentationThe request parameters and response format are automatically generated from the OpenAPI specification. All parameters, their types, descriptions, defaults, and examples are pulled directly from openapi.json. Scroll down to see the interactive API reference.

FAQ

How to handle rate limits?

When encountering 429 Too Many Requests, we recommend implementing exponential backoff retry:

import time
import random

def chat_with_retry(messages, max_retries=3):
    for i in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
            return response
        except RateLimitError:
            if i < max_retries - 1:
                wait_time = (2 ** i) + random.random()
                time.sleep(wait_time)
            else:
                raise

How to maintain conversation context?

Include the complete conversation history in the messages array:

conversation_history = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language..."},
    {"role": "user", "content": "What are its advantages?"}
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=conversation_history
)

What does finish_reason mean?

Value	Meaning
`stop`	Natural completion
`length`	Reached max_tokens limit
`content_filter`	Triggered content filter
`function_call`	Model called a function

How to control costs?

Use max_tokens to limit output length
Choose appropriate models (e.g., GPT-3.5 Turbo is more economical than GPT-4)
Streamline prompts, avoid redundant context
Monitor token consumption in the usage field of responses

How to use streaming?

Enable streaming by setting stream: true:

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Authorizations

Authorization

string

header

required

Bearer token authentication. Include your API key in the Authorization header as 'Bearer YOUR_API_KEY'

Body

application/json

model

string

required

ID of the model to use. Specifies the model ID to use for generating responses. See the model catalog for available models and which models work with the Chat API.

Example:

"gpt-4"

messages

object[]

required

A list of messages comprising the conversation so far. List of conversation messages containing roles and content. Each message should include a role (system, user, or assistant) and content (the message text).

Minimum array length: 1

Show child attributes

temperature

number

default:1

Controls the randomness of responses, range 0-2. Lower values (e.g., 0.2) make the output more deterministic and focused, while higher values (e.g., 1.8) make it more random and creative. It's not recommended to adjust both temperature and top_p simultaneously.

Required range: 0 <= x <= 2

top_p

number

default:1

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. It's not recommended to adjust both temperature and top_p simultaneously.

Required range: 0 <= x <= 1

integer

default:1

How many chat completion choices to generate for each input message. Range: 1-128.

Required range: 1 <= x <= 128

stream

boolean

default:false

Whether to enable streaming response. When set to true, the response will be returned in chunks as Server-Sent Events (SSE). Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. This is useful for real-time chat applications.

stop

Up to 4 sequences where the API will stop generating further tokens.

max_tokens

integer

Limits the maximum number of tokens to generate. The total length of input tokens and generated tokens is limited by the model's context length.

Required range: x >= 1

presence_penalty

number

default:0

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Required range: -2 <= x <= 2

frequency_penalty

number

default:0

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

Required range: -2 <= x <= 2

logit_bias

object

Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling.

Show child attributes

Response

Successful chat completion response

Response object for chat completion. Contains the generated response, metadata, and token usage information.

string

required

A unique identifier for the chat completion

object

enum<string>

required

The object type, which is always 'chat.completion' for non-streaming responses, or 'chat.completion.chunk' for streaming responses

Available options:

chat.completion,

chat.completion.chunk

created

integer

required

The Unix timestamp (in seconds) of when the chat completion was created

model

string

required

The model used for the chat completion

choices

object[]

required

A list of chat completion choices. Can be more than one if n is greater than 1. Each choice contains the generated message and finish reason.

Show child attributes

usage

object

Token usage statistics for the request

Show child attributes

Text Models

Image Models

Video Models

Error Handling

Overview

Important Notes

FAQ

How to handle rate limits?

How to maintain conversation context?

What does finish_reason mean?

How to control costs?

How to use streaming?

Authorizations

Body

Response

Text Models

Image Models

Video Models

Error Handling

​Overview

​Important Notes

​FAQ

​How to handle rate limits?

​How to maintain conversation context?

​What does finish_reason mean?

​How to control costs?

​How to use streaming?

Authorizations

Body

Response

Overview

Important Notes

FAQ

How to handle rate limits?

How to maintain conversation context?

What does finish_reason mean?

How to control costs?

How to use streaming?