Skip to main content

Text Generation Model

Moonshot’s text generation model (referred to as moonshot-v1) is trained to understand both natural and written language. It can generate text output based on the input provided. The input to the model is also known as a “prompt.” We generally recommend that you provide clear instructions and some examples to enable the model to complete the intended task. Designing a prompt is essentially learning how to “train” the model. The moonshot-v1 model can be used for a variety of tasks, including content or code generation, summarization, conversation, and creative writing.

Language Model Inference Service

The language model inference service is an API service based on the pre-trained models developed and trained by us (Moonshot AI). In terms of design, we primarily offer a Chat Completions interface externally, which can be used to generate text. However, it does not support access to external resources such as the internet or databases, nor does it support the execution of any code.

Token

Text generation models process text in units called Tokens. A Token represents a common sequence of characters. For example, a single English character like “antidisestablishmentarianism” might be broken down into a combination of several Tokens, while a short and common phrase like “word” might be represented by a single Token. Generally speaking, for a typical English text, 1 Token is roughly equivalent to 3-4 English characters. It is important to note that for our text model, the total length of Input and Output cannot exceed the model’s maximum context length.

Rate Limits

How do these rate limits work? Rate limits are measured in four ways: concurrency, RPM (requests per minute), TPM (Tokens per minute), and TPD (Tokens per day). The rate limit can be reached in any of these categories, depending on which one is hit first. For example, you might send 20 requests to ChatCompletions, each with only 100 Tokens, and you would hit the limit (if your RPM limit is 20), even if you haven’t reached 200k Tokens in those 20 requests (assuming your TPM limit is 200k). For the gateway, for convenience, we calculate rate limits based on the max_tokens parameter in the request. This means that if your request includes the max_tokens parameter, we will use this parameter to calculate the rate limit. If your request does not include the max_tokens parameter, we will use the default max_tokens parameter to calculate the rate limit. After you make a request, we will determine whether you have reached the rate limit based on the number of Tokens in your request plus the number of max_tokens in your parameter, regardless of the actual number of Tokens generated. In the billing process, we calculate the cost based on the number of Tokens in your request plus the actual number of Tokens generated.

Other Important Notes:

  • Rate limits are enforced at the user level, not the key level.
  • Currently, we share rate limits across all models.

Model List

For all available models and their capabilities, see the Model List page.

Usage Guide

Getting an API Key

You need an API key to use our service. You can create an API key in our Console.

Sending Requests

You can use our Chat Completions API to send requests. You need to provide an API key and a model name. You can choose to use the default max_tokens parameter or customize the max_tokens parameter. You can refer to the Chat API documentation for the calling method.

Handling Responses

Generally, we set a 2 hours timeout. If a single request exceeds this time, we will return a 504 error. If your request exceeds the rate limit, we will return a 429 error. If your request is successful, we will return a response in JSON format. If you need to quickly process some tasks, you can use the non-streaming mode of our Chat Completions API. In this mode, we will return all the generated text in one request. If you need more control, you can use the streaming mode. In this mode, we will return an SSE stream, where you can obtain the generated text. This can provide a better user experience, and you can also interrupt the request at any time without wasting resources.