Cortex LLM REST API¶
Snowflake Cortex LLM Functions provide natural language processing features powered by a variety of large language models (LLMs) using SQL or Python. For more information, see Large Language Model (LLM) Functions (Snowflake Cortex).
The Snowflake Cortex LLM REST API gives you access to the COMPLETE function from any programming language that can make HTTP POST requests, allowing you to bring state-of-the-art AI functionality to your applications. Using this API does not require a warehouse.
The Cortex LLM REST API streams generated tokens back to the user as server-sent events.
Cost considerations¶
Snowflake Cortex LLM REST API requests incur compute costs based on the number of tokens processed. Refer to the Snowflake Service Consumption Table for each function’s cost in credits per million tokens. A token is the smallest unit of text processed by Snowflake Cortex LLM functions, approximately equal to four characters of text. The equivalence of raw input or output text to tokens can vary by model.
The COMPLETE function generates new text given an input prompt. Both input and output tokens incur compute cost. If you use COMPLETE to provide a conversational or chat user experience, all previous prompts and responses are processed to generate each new response, with corresponding costs.
Usage quotas¶
To ensure a high standard of performance for all Snowflake customers, Snowflake Cortex LLM REST API requests are subject to usage quotas beyond which requests may be throttled. Snowflake may adjust these quotas from time to time. The quotas in the table below are applied per account and are independently applied for each model.
Function
(Model)
|
Tokens Processed
per Minute (TPM)
|
Requests per
Minute (RPM)
|
---|---|---|
COMPLETE
(
llama2-70b-chat ) |
200,000 |
100 |
COMPLETE
(
llama3-8b ) |
400,000 |
200 |
COMPLETE
(
llama3-70b ) |
200,000 |
100 |
COMPLETE
(
llama3.1-8b ) |
400,000 |
200 |
COMPLETE
(
llama3.1-70b ) |
200,000 |
100 |
COMPLETE
(
llama3.1-405b ) |
100,000 |
50 |
COMPLETE
(
reka-core ) |
200,000 |
100 |
COMPLETE
(
reka-flash ) |
200,000 |
100 |
COMPLETE
(
mistral-large ) |
200,000 |
100 |
COMPLETE
(
mixtral-8x7b ) |
200,000 |
100 |
COMPLETE
(
mistral-7b ) |
400,000 |
200 |
COMPLETE
(
jamba-instruct ) |
100,000 |
50 |
COMPLETE
(
gemma-7b ) |
400,000 |
200 |
COMPLETE endpoint¶
The /api/v2/cortex/inference:complete
endpoint executes the SQL COMPLETE function. It takes the form:
POST https://<account_identifier>.snowflakecomputing.com/api/v2/cortex/inference:complete
where account_identifier
is the account identifier you use to access Snowsight.
Note
Currently, only the COMPLETE function is supported. Additional functions may be supported in a future version of the Cortex LLM REST API.
Setting up authentication¶
Authenticating to the Cortex LLM REST API uses key-pair authentication. This requires creating an RSA key pair and assigning its public key to a user, which must be done using the SECURITYADMIN role (or another role that has had SECURITYADMIN granted, such as ACCOUNTADMIN). For step-by-step instructions, see Configuring key-pair authentication.
Tip
Consider creating a dedicated user for Cortex LLM REST API requests.
To make API requests, use the public key to create a JSON Web token (JWT) and pass it in the headers of the request.
Submitting requests¶
You make a request to the Cortex LLM REST API by POSTing to the API’s REST endpoint. The Authorization
header must contain a
JSON Web token generated from your public key, which you can do using snowsql
via the following command. The
generated JWT expires after one hour.
snowsql -a <account_identifier> -u <user> --private-key-path <path>/rsa_key.p8 --generate-jwt
The body of the request is a JSON object that specifies the model, the prompt or conversation history, and options. See the following API Reference for details.
API Reference¶
POST /api/v2/cortex/inference:complete¶
Completes a prompt or conversation using the specified large language model. The body of the request is a JSON object containing the arguments.
This endpoint corresponds to the COMPLETE SQL function.
Required headers¶
X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT
Defines the type of authorization token.
Authorization: Bearer jwt
.Authorization for the request.
jwt
is a valid JSON Web token.Content-Type: application/json
Specifies that the body of the request is in JSON format.
Accept: application/json, text/event-stream
Specifies that the response will either contain JSON (error case) or server-sent events.
Required JSON arguments¶
Argument |
Type |
Description |
---|---|---|
|
string |
The identifier of the model to be used (see Choosing a model). Must be one of the following values.
|
|
array |
The prompt or conversation history to be used to generate a completion. An array of objects representing
a conversation in chronological order. Each object must contain a
See the COMPLETE roles table for a more detailed description of these roles. For prompts consisting of a single user message, |
Optional JSON arguments¶
Argument |
Type |
Default |
Description |
---|---|---|---|
|
number |
1.0 |
A value from 0 to 1 (inclusive) that controls the diversity of the language model by restricting the set of possible tokens that the model outputs. |
|
number |
0.0 |
A value from 0 to 1 (inclusive) that controls the randomness of the output of the language model by influencing which possible token is chosen at each step. |
|
integer |
4096 |
The maximum number of tokens to output. Output is truncated after this number of tokens. Note You can set |
Output¶
Tokens are sent as they are generated using server-sent events (SSEs). Each SSE event uses the message
type
and contains a JSON object with the following structure.
Key |
Value type |
Description |
---|---|---|
|
string |
Unique ID of the request, the same value for all events sent in response to the request. |
|
number |
UNIX timestamp (seconds since midnight, January 1, 1970) when the response was generated. |
|
string |
Identifier of the model. |
|
array |
The model’s responses. Each response is an object containing a |
Status codes¶
The Snowflake Cortex LLM REST API uses the following HTTP status codes to indicate successful completion or various error conditions.
- 200
OK
Request completed successfully. The body of the response contains the output of the model.
- 400
invalid options object
The optional arguments have invalid values.
- 400
unknown model model_name
The specified model does not exist.
- 400
max tokens of count exceeded
The request exceeded the maximum number of tokens supported by the model (see Model restrictions).
- 400
all requests were throttled by remote service
The request has been throttled due to a high level of usage. Try again later.
- 402
budget exceeded
The model consumption budget was exceeded.
- 403
Not Authorized
Account not enabled for REST API, or the default role for the calling user does not have the
snowflake.cortex_user
database role.- 429
too many requests
The request was rejected because the usage quota has been exceeded. Please try your request later.
- 503
inference timed out
The request took too long.
Example¶
The following example uses curl
to make a COMPLETE request. Replace jwt
, prompt
, and
account_identifier
with the appropriate values in this command.
curl -X POST \
-H 'X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT' \
-H "Authorization: Bearer <jwt>" \
-H 'Content-Type: application/json' \
-H 'Accept: application/json, text/event-stream' \
-d '{
"model": "mistral-large",
"messages": [
{
"content": "<prompt>"
}
],
"top_p": 0,
"temperature": 0
}' \
https://<account_identifier>.snowflakecomputing.com/api/v2/cortex/inference:complete
Output¶
data: {
data: "id": "65c5e2ac-529b-461e-8a8c-f80655e6bd3f",
data: "created": 1723493954,
data: "model": "mistral-7b",
data: "choices": [
data: {
data: "delta": {
data: "content": "Cor"
data: }
data: }
data: ],
data: "usage": {
data: "prompt_tokens": 57,
data: "completion_tokens": 1,
data: "total_tokens": 58
data: }
data: }
data: {
data: "id": "65c5e2ac-529b-461e-8a8c-f80655e6bd3f",
data: "created": 1723493954,
data: "model": "mistral-7b",
data: "choices": [
data: {
data: "delta": {
data: "content": "tex"
data: }
data: }
data: ],
data: "usage": {
data: "prompt_tokens": 57,
data: "completion_tokens": 2,
data: "total_tokens": 59
data: }
data: }
Python API¶
To install the Python API, use:
pip install snowflake-ml-python
The Python API is included in the snowflake-ml-python
package starting with version 1.6.1.
Example¶
To use the Python API, first create a Snowflake session (see Creating a Session for Snowpark Python). Then call the
Complete API. The REST back end is used only when stream=True
is specified.
from snowflake.snowpark import Session
from snowflake.cortex import Complete
session = Session.builder.configs(...).create()
stream = Complete(
"mistral-7b",
"What are unique features of the Snowflake SQL dialect?",
session=session,
stream=True)
for update in stream:
print(update)
Note
The streaming mode of the Python API currently doesn’t work in stored procedures and in Snowsight.