moonshot-v1-8k-vision-preview / moonshot-v1-32k-vision-preview / moonshot-v1-128k-vision-preview / kimi-k2.5 and so on) can understand visual content, including text in the image, colors, and the shapes of objects. The latest kimi-k2.5 model can also understand video content.
Using base64 to Upload Images Directly
Here is how you can ask Kimi questions about an image using the following code:message.content field has changed from str to List[Dict] (i.e., a JSON array). Additionally, do not serialize the JSON array and put it into message.content as a str. This will cause Kimi to fail to correctly identify the image type and may trigger the Your request exceeded model token limit error.
✅ Correct Format:
Using Uploaded Images or Videos
In the previous example, ourimage_url was a base64-encoded image. Since video files are often larger, we provide an additional method where you can first upload images or videos to Moonshot, then reference them via file ID. For information on uploading images or videos, please refer to Image Understanding Upload
video_url.url is ms://<file-id>, where ms is short for moonshot storage, which is Moonshot’s internal protocol for referencing files.
Supported Formats
Images support the following formats:- png
- jpeg
- webp
- gif
- mp4
- mpeg
- mov
- avi
- x-flv
- mpg
- webm
- wmv
- 3gpp
Token Calculation and Costs
Images and videos use dynamic token calculation. You can obtain the token consumption of a request containing images or videos through the estimate tokens API before starting the understanding process. Generally speaking, the higher the image resolution, the more tokens it consumes. Videos are composed of several key frames. The more key frames and the higher the resolution, the more tokens are consumed. The Vision model follows the same pricing model as themoonshot-v1 series, with costs based on the total tokens used for model inference. For more details on token pricing, please refer to:
Model Inference Pricing
Best Practices
Resolution
We recommend that image resolution does not exceed 4k (4096×2160), and video resolution does not exceed 2k (2048×1080). Resolutions higher than recommended will only cost more time processing the input without improving model understanding performance.File Upload vs base64
Due to our overall request body size limitations, very large videos should be processed using the file upload method for visual understanding. For images or videos that need to be referenced multiple times, we recommend using the file upload method for visual understanding. Regarding file upload limitations, please refer to the File Upload documentation.Features and Limitations
The Vision model supports the following features:- Multi-turn conversations
- Streaming output
- Tool invocation
- JSON Mode
- Partial Mode
- URL-formatted images: Not supported, currently only supports base64-encoded image content and images/videos uploaded via file ID
- Image quantity: The Vision model has no limit on the number of images, but ensure that the request body size does not exceed 100M.
Parameters Differences in Request Body
Parameters are listed in chat. However, behaviour of some parameters may be different in k2.5 models. We recommend using the default values instead of manually configuring these parameters. Differences are listed below.| Field | Required | Description | Type | Values |
|---|---|---|---|---|
| max_tokens | optional | The maximum number of tokens to generate for the chat completion. | int | Default to be 32k aka 32768 |
| thinking | optional | New! This parameter controls if the thinking is enabled for this request | object | Default to be {"type": "enabled"}. Value can only be one of {"type": "enabled"} or {"type": "disabled"} |
| temperature | optional | The sampling temperature to use | float | k2.5 model will use a fixed value 1.0, non-thinking mode will use a fixed value 0.6. Any other value will result in an error |
| top_p | optional | A sampling method | float | k2.5 model will use a fixed value 0.95. Any other value will result in an error |
| n | optional | The number of results to generate for each input message | int | k2.5 model will use a fixed value 1. Any other value will result in an error |
| presence_penalty | optional | Penalizing new tokens based on whether they appear in the text | float | k2.5 model will use a fixed value 0.0. Any other value will result in an error |
| frequency_penalty | optional | Penalizing new tokens based on their existing frequency in the text | float | k2.5 model will use a fixed value 0.0. Any other value will result in an error |