Skip to main content
POST
/
v1
/
chat
/
completions
curl --request POST \
  --url https://wisdom-gate.juheapi.com/v1/chat/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "Hello! How can you help me?"
    }
  ]
}
'
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I can help you with a wide variety of tasks. I can answer questions, provide explanations, help with coding, writing, analysis, and much more. What would you like to know or work on?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 20,
    "total_tokens": 30
  }
}

Overview

chat/completions is the most common API endpoint for LLMs, which takes a conversation list composed of multiple messages as input to get model responses. This endpoint follows the OpenAI Chat Completions API format, making it easy to integrate with existing OpenAI-compatible code.

Important Notes

Model DifferencesDifferent model providers may support different request parameters and return different response fields. We strongly recommend consulting the model catalog for complete parameter lists and usage instructions for each model.
Response Pass-through PrincipleWisdom Gate typically does not modify model responses outside of reverse format, ensuring you receive response content consistent with the original API provider.
Streaming SupportWisdom Gate supports Server-Sent Events (SSE) for streaming responses. Set "stream": true in your request to enable real-time streaming, which is useful for chat applications.

Auto-Generated DocumentationThe request parameters and response format are automatically generated from the OpenAPI specification. All parameters, their types, descriptions, defaults, and examples are pulled directly from openapi.json. Scroll down to see the interactive API reference.

FAQ

How to handle rate limits?

When encountering 429 Too Many Requests, we recommend implementing exponential backoff retry:
import time
import random

def chat_with_retry(messages, max_retries=3):
    for i in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
            return response
        except RateLimitError:
            if i < max_retries - 1:
                wait_time = (2 ** i) + random.random()
                time.sleep(wait_time)
            else:
                raise

How to maintain conversation context?

Include the complete conversation history in the messages array:
conversation_history = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language..."},
    {"role": "user", "content": "What are its advantages?"}
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=conversation_history
)

What does finish_reason mean?

ValueMeaning
stopNatural completion
lengthReached max_tokens limit
content_filterTriggered content filter
function_callModel called a function

How to control costs?

  1. Use max_tokens to limit output length
  2. Choose appropriate models (e.g., GPT-3.5 Turbo is more economical than GPT-4)
  3. Streamline prompts, avoid redundant context
  4. Monitor token consumption in the usage field of responses

How to use streaming?

Enable streaming by setting stream: true:
stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Authorizations

Authorization
string
header
required

Bearer token authentication. Include your API key in the Authorization header as 'Bearer YOUR_API_KEY'

Body

application/json
model
string
required

ID of the model to use. Specifies the model ID to use for generating responses. See the model catalog for available models and which models work with the Chat API.

Example:

"gpt-4"

messages
object[]
required

A list of messages comprising the conversation so far. List of conversation messages containing roles and content. Each message should include a role (system, user, or assistant) and content (the message text).

Minimum array length: 1
temperature
number
default:1

Controls the randomness of responses, range 0-2. Lower values (e.g., 0.2) make the output more deterministic and focused, while higher values (e.g., 1.8) make it more random and creative. It's not recommended to adjust both temperature and top_p simultaneously.

Required range: 0 <= x <= 2
top_p
number
default:1

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. It's not recommended to adjust both temperature and top_p simultaneously.

Required range: 0 <= x <= 1
n
integer
default:1

How many chat completion choices to generate for each input message. Range: 1-128.

Required range: 1 <= x <= 128
stream
boolean
default:false

Whether to enable streaming response. When set to true, the response will be returned in chunks as Server-Sent Events (SSE). Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. This is useful for real-time chat applications.

stop

Up to 4 sequences where the API will stop generating further tokens.

max_tokens
integer

Limits the maximum number of tokens to generate. The total length of input tokens and generated tokens is limited by the model's context length.

Required range: x >= 1
presence_penalty
number
default:0

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Required range: -2 <= x <= 2
frequency_penalty
number
default:0

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

Required range: -2 <= x <= 2
logit_bias
object

Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling.

Response

Successful chat completion response

Response object for chat completion. Contains the generated response, metadata, and token usage information.

id
string
required

A unique identifier for the chat completion

object
enum<string>
required

The object type, which is always 'chat.completion' for non-streaming responses, or 'chat.completion.chunk' for streaming responses

Available options:
chat.completion,
chat.completion.chunk
created
integer
required

The Unix timestamp (in seconds) of when the chat completion was created

model
string
required

The model used for the chat completion

choices
object[]
required

A list of chat completion choices. Can be more than one if n is greater than 1. Each choice contains the generated message and finish reason.

usage
object

Token usage statistics for the request