- How to use streaming output;
- Common issues when using streaming output;
- How to handle streaming output without using the Python SDK;
How to Use Streaming Output
Streaming, in a nutshell, means that whenever the Kimi large language model generates a certain number of Tokens (usually 1 Token), it immediately sends these Tokens to the client, instead of waiting for all Tokens to be generated before sending them to the client. When you chat with Kimi AI Assistant, the assistant’s response appears character by character, which is one manifestation of streaming output. Streaming allows users to see the first Token output by the Kimi large language model immediately, reducing wait time. You can use streaming output in this way (stream=True) and get the streaming response:- python
- node.js
Common Issues When Using Streaming Output
Now that you have successfully run the above code and understood the basic principles of streaming output, let’s discuss some details and common issues of streaming output so that you can better implement your business logic.Interface Details
When streaming output mode is enabled (stream=True), the Kimi large language model no longer returns a response in JSON format (Content-Type: application/json), but uses Content-Type: text/event-stream (abbreviated as SSE). This response format allows the server to continuously send data to the client. In the context of using the Kimi large language model, it can be understood as the server continuously sending Tokens to the client.
When you look at the HTTP response body of SSE, it looks like this:
data: , followed by a valid JSON object, and ends with two newline characters \n\n. Finally, when all data chunks have been transmitted, data: [DONE] is used to indicate that the transmission is complete, at which point the network connection can be disconnected.
Token Calculation
When using the streaming output mode, there are two ways to calculate tokens. The most straightforward and accurate method is to wait until all data chunks have been transmitted and then check theprompt_tokens, completion_tokens, and total_tokens in the usage field of the last data chunk.
- python
- node.js
How to Terminate Output
If you want to stop the streaming output, you can simply close the HTTP connection or discard any subsequent data chunks. For example:How to Handle Streaming Output Without Using an SDK
If you prefer not to use the Python SDK to handle streaming output and instead want to directly interface with HTTP APIs to use the Kimi large language model (for example, in cases where you are using a language without an SDK, or you have unique business logic that the SDK cannot meet), we provide some examples to help you understand how to properly handle the SSE response body in HTTP (we still use Python code as an example here, with detailed explanations provided in comments).- python
- node.js
- Initiate an HTTP request and set the
streamparameter in the request body totrue; - Receive the response from the server. Note that if the
Content-Typein the responseHeadersistext/event-stream, it indicates that the response content is a streaming output; - Read the response content line by line and parse the data chunks (the data chunks are presented in JSON format). Pay attention to determining the start and end positions of the data chunks through the
data:prefix and newline character\n; - Determine whether the transmission is complete by checking if the current data chunk content is
[DONE];
data: [DONE] to determine if the data has been fully transmitted, rather than using finish_reason or other methods. If you do not receive the data: [DONE] message chunk, even if you have obtained the information finish_reason=stop, you should not consider the data chunk transmission as complete. In other words, until you receive the data: [DONE] data chunk, the message should be considered incomplete.
During the streaming output process, only the content field is streamed, meaning each data chunk contains a portion of the content tokens. For fields that do not need to be streamed, such as role and usage, we usually present them all at once in the first or last data chunk, rather than including the role and usage fields in every data chunk (specifically, the role field will only appear in the first data chunk and will not be included in subsequent data chunks; the usage field will only appear in the last data chunk and will not be included in the preceding data chunks).
Handling n>1
Sometimes, we want to get multiple results to choose from. To do this, you should set the n parameter in the request to a value greater than 1. When it comes to stream output, we also support the use of n>1. In such cases, we need to add some extra code to determine the index value of the current data block, to figure out which response the data block belongs to. Let’s illustrate this with example code:
- python
- node.js
n>1, the key to handling stream output is to first determine which response message the current data block belongs to based on its index value, and then proceed with further logical processing.