REF: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_stream_completions.ipynb
How to stream completions
默认情况下,当你请求OpenAI的完成时,整个完成内容会在生成后作为单个响应返回。
如果你正在生成长的完成,等待响应可能需要多秒钟。
为了更快地获得响应,你可以在生成完成时“流式传输”完成。这使你可以在完成全文结束之前开始打印或处理完成的开头部分。
要流式传输完成,当调用聊天完成或完成端点时,设置stream=True。这将返回一个对象,以数据为唯一的服务器推送事件流方式返回响应。从delta字段而不是message字段中提取块。
缺点
请注意,在生产应用程序中使用“stream = True”会使内容的评估更加困难,因为部分完成可能更难以评估,这对于批准的使用有影响。
流式响应的另一个小缺点是响应不再包括使用字段来告诉您已经使用了多少个令牌。在接收和组合所有响应后,您可以使用tiktoken自己计算出这个值。
Example code
Below, this notebook shows:
- What a typical chat completion response looks like
- What a streaming chat completion response looks like
- How much time is saved by streaming a chat completion
- How to stream non-chat completions (used by older models like text-davinci-003)
# imports
import openai # for OpenAI API calls
import time # for measuring time duration of API calls
1. 一个典型的聊天完成响应看起来是什么样子的
With a typical ChatCompletions API call, the response is first computed and then returned all at once.
# Example of an OpenAI ChatCompletion request
# https://platform.openai.com/docs/guides/chat
# record the time before the request is sent
start_time = time.time()
# send a ChatCompletion request to count to 100
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
temperature=0,
)
# calculate the time it took to receive the response
response_time = time.time() - start_time
# print the time delay and text received
print(f"Full response received {response_time:.2f} seconds after request")
print(f"Full response received:\n{response}")
Full response received 3.03 seconds after request
Full response received:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "\n\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.",
"role": "assistant"
}
}
],
"created": 1677825456,
"id": "chatcmpl-6ptKqrhgRoVchm58Bby0UvJzq2ZuQ",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion",
"usage": {
"completion_tokens": 301,
"prompt_tokens": 36,
"total_tokens": 337
}
}
The reply can be extracted with response['choices'][0]['message']
.
The content of the reply can be extracted with response['choices'][0]['message']['content']
.
reply = response['choices'][0]['message']
print(f"Extracted reply: \n{reply}")
reply_content = response['choices'][0]['message']['content']
print(f"Extracted content: \n{reply_content}")
Extracted reply:
{
"content": "\n\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.",
"role": "assistant"
}
Extracted content:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.
2. How to stream a chat completion
通过流API调用,响应以事件流的形式分成块逐步发送回来。在Python中,你可以使用for循环迭代这些事件。
让我们看看它是什么样子的:
# Example of an OpenAI ChatCompletion request with stream=True
# https://platform.openai.com/docs/guides/chat
# a ChatCompletion request
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'user', 'content': "What's 1+1? Answer in one word."}
],
temperature=0,
stream=True # this time, we set stream=True
)
for chunk in response:
print(chunk)
{
"choices": [
{
"delta": {
"role": "assistant"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "\n\n"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "2"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {},
"finish_reason": "stop",
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
As you can see above, streaming responses have a delta field rather than a message field. delta can hold things like:
a role token (e.g., {"role": "assistant"})
a content token (e.g., {"content": "\n\n"})
nothing (e.g., {}), when the stream is over