- For any unlisted or closed-source benchmark: set
temperature = 1.0,stream = true,top_p = 0.95 - Reasoning benchmarks:
max_tokens = 128k, and run at least 500–1000 samples to get low variance (e.g.AIME 2025: 32 runs -> 30 × 32 = 960 questions) - Coding benchmarks:
max_tokens = 256k - Agentic task benchmarks:
- For multi-hop search:
max_tokens = 256k+ context management - Others:
max_tokens ≥ 16k–64k
- For multi-hop search:
K2.5 Models Benchmark Recommended Settings
| Benchmark Category | Benchmark | Temperature | Recommended max tokens | Recommended runs | Top-p | Others (e.g. test log) |
|---|---|---|---|---|---|---|
| Multi-modal | MMMU-Pro | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= |
| CharXiv (RQ) | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| MathVision | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| MathVista | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| OCRBench | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| ZeroBench | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| WorldVQA | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| InfoVQA (val) | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| SimpleVQA | 1.0 | max tokens = 64k | 3 | top_p=0.95 | thinking= | |
| ZeroBench w/ tools | 1.0 | max tokens = 64k | 3 | top_p=0.95 | Recommended max steps = 30 thinking= | |
| Code | SWE Series | 1.0 | per step tokens = 16k; total max tokens = 256k | 5 | top_p=0.95 | thinking= |
| Lcb + OJBench | 1.0 | max tokens = 128k | 1 | top_p=0.95 | thinking= | |
| TerminalBench | 1.0 | max tokens = 128k | 3 | top_p=0.95 | thinking= | |
| Reasoning | AIME2025 no tools | 1.0 | total max tokens = 96k | 32 | top_p=0.95 | thinking= |
| AIME2025 w/ tools | 1.0 | per turn tokens = 96k; total max tokens = 96k | 32 | top_p=0.95 | thinking= Recommended max steps = 120 | |
| HLE no tools | 1.0 | max tokens = 96k | 1 | top_p=0.95 | thinking= | |
| HLE w/ tools | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | top_p=0.95 | thinking= Recommended max steps = 120 | |
| HLE heavy | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | top_p=0.95 | thinking= Recommended max steps = 200 parallel n=8 | |
| HMMT2025 no tools | 1.0 | max tokens = 96k | 32 | top_p=0.95 | thinking= | |
| HMMT2025 w/tools | 1.0 | per step tokens = 96k; total tokens = 96k | 32 | top_p=0.95 | thinking= Recommended max steps = 120 | |
| IMO-AnswerBench | 1.0 | max tokens = 96k | 3 | top_p=0.95 | thinking= | |
| GPQA-Diamond | 1.0 | max tokens = 96k | 8 | top_p=0.95 | thinking= | |
| Agentic Search Task | BrowseComp / BrowseComp-ZH / Seal-0 / Frames | 1.0 | per step tokens = 24k; total max tokens = 256k | 4 | top_p=0.95 | thinking= Recommended max steps = 250 Recommend using a context management mechanism to prevent overly long context and ensure enough tool calls Include today’s date in the system prompt and let the model search when it is uncertain |
| Agentic Task | Tau | 1.0 | >=16k | 4 | top_p=0.95 | thinking= Recommended max steps = 100 |
{"type": "enabled"}, please note the following constraints to ensure model performance:
tool_choicecan only be set to “auto” or “none” (default is “auto”) to avoid conflicts between reasoning content and the specified tool_choice. Any other value will result in an error;- During multi-step tool calling, you must keep the
reasoning_contentfrom the assistant message in the current turn’s tool call within the context, otherwise an error will be thrown; - The official builtin
$web_searchtool is temporarily incompatible with Kimi K2.5 thinking mode, you can choose to disable thinking mode first and then use the$web_searchtool.
K2-Thinking Series Models Benchmark Recommended Settings
| Category | Benchmark | Temperature | Max token | Suggested runs | Notes |
|---|---|---|---|---|---|
| Code | SWE | 0.7(recommended) 1.0 (ok) | per step tokens = 16k; total max token = 256k | 5 | |
| Lcb + OJBench | 1.0 | max tokens = 128k | 1 | ||
| TerminalBench | 1.0 | max tokens = 128k | 3 | ||
| Reasoning | AIME2025 no tools | 1.0 | total max tokens = 96k | 32 | |
| AIME2025 w/ tools | 1.0 | per step tokens = 48k; total max tokens = 128k | 16 | max steps = 120 | |
| HLE no tools | 1.0 | max tokens = 96k | 1 | ||
| HLE w/ tools | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | max steps = 120 | |
| HLE heavy | 1.0 | total max tokens = 128k; per step tokens = 48k | 1 | max steps = 200 parallel n=8 | |
| HMMT2025 no tools | 1.0 | max tokens = 96k | 32 | ||
| HMMT2025 w/tools | 1.0 | per step tokens = 96k; total tokens = 96k | 32 | max steps = 120 | |
| IMO-AnswerBench | 1.0 | max tokens = 96k | 3 | ||
| GPQA-Diamond | 1.0 | max tokens = 96k | 8 | ||
| Agentic Search Task | BrowseComp/ BrowseComp-ZH/Seal-0/ Frames | 1.0 | per step tokens = 24k; total max tokens = 256k | 4 | max steps = 250 Enable context management to prevent context overflow and ensure enough tool calls. Include today’s date in the system prompt, and tell the model to search when unsure. |
| Agentic Task | Tau | 0.0 | >=16k | 4 | max steps = 100 |
API Recommendations & Notes
- Use the official API: some 3rd-party endpoints show noticeable accuracy drift.
- Use the recommended models for testing
- For K2 series: use
kimi-k2-thinking-turbofor faster inference - For K2.5: use
kimi-k2.5for testing
- For K2 series: use
- Must set:
stream = true- Non-streaming mode can lead to random mid-connection interruptions that are hard to control.
- Current API default settings:
- Kimi K2 Thinking:
- default temp = 1.0
- default max token = 64000
- Kimi K2.5:
- default max_tokens = 32768
- default thinking =
{"type": "enabled"} - default temperature = 1.0
- default top_p = 0.95
- default n = 1
- default presence_penalty = 0.0
- default frequency_penalty = 0.0
- Kimi K2 Thinking:
- Timeouts:
- With
stream = false,api.moonshot.aitimeout = 2 hours, but some ISPs may terminate earlier. - So again we recommend you to set
stream = true
- With
- Concurrency:
- Keep concurrency low to avoid rate limiting
- Retry logic is not optional:
- handle overloaded
- handle unexpected finish reason due to random server issues
- handle errors due to complicated network issues
FAQ
Q1. Is the temperature setting consistent across models? A. No. Different model families use different recommended temperatures:- k2.5 model: temperature = 1.0
- k2-thinking series: temperature = 1.0
- k2 other series: temperature = 0.6