Create chat completions using OpenAI-compatible API format with support for multiple AI models
chat/completions is the most common API endpoint for LLMs, which takes a conversation list composed of multiple messages as input to get model responses. This endpoint follows the OpenAI Chat Completions API format, making it easy to integrate with existing OpenAI-compatible code.
openapi.json. Scroll down to see the interactive API reference.429 Too Many Requests, we recommend implementing exponential backoff retry:
messages array:
| Value | Meaning |
|---|---|
stop | Natural completion |
length | Reached max_tokens limit |
content_filter | Triggered content filter |
function_call | Model called a function |
max_tokens to limit output lengthusage field of responsesstream: true:
Bearer token authentication. Include your API key in the Authorization header as 'Bearer YOUR_API_KEY'
ID of the model to use. Specifies the model ID to use for generating responses. See the model catalog for available models and which models work with the Chat API.
"gpt-4"
A list of messages comprising the conversation so far. List of conversation messages containing roles and content. Each message should include a role (system, user, or assistant) and content (the message text).
1Controls the randomness of responses, range 0-2. Lower values (e.g., 0.2) make the output more deterministic and focused, while higher values (e.g., 1.8) make it more random and creative. It's not recommended to adjust both temperature and top_p simultaneously.
0 <= x <= 2An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. It's not recommended to adjust both temperature and top_p simultaneously.
0 <= x <= 1How many chat completion choices to generate for each input message. Range: 1-128.
1 <= x <= 128Whether to enable streaming response. When set to true, the response will be returned in chunks as Server-Sent Events (SSE). Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. This is useful for real-time chat applications.
Up to 4 sequences where the API will stop generating further tokens.
Limits the maximum number of tokens to generate. The total length of input tokens and generated tokens is limited by the model's context length.
x >= 1Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
-2 <= x <= 2Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
-2 <= x <= 2Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling.
Successful chat completion response
Response object for chat completion. Contains the generated response, metadata, and token usage information.
A unique identifier for the chat completion
The object type, which is always 'chat.completion' for non-streaming responses, or 'chat.completion.chunk' for streaming responses
chat.completion, chat.completion.chunk The Unix timestamp (in seconds) of when the chat completion was created
The model used for the chat completion
A list of chat completion choices. Can be more than one if n is greater than 1. Each choice contains the generated message and finish reason.
Token usage statistics for the request