Cortex LLM REST API

Snowflake Cortex LLM Functions provide natural language processing features powered by a variety of large language models (LLMs) using SQL or Python. For more information, see Large Language Model (LLM) Functions (Snowflake Cortex).

The Snowflake Cortex LLM REST API gives you access to the COMPLETE function from any programming language that can make HTTP POST requests, allowing you to bring state-of-the-art AI functionality to your applications. Using this API does not require a warehouse.

The Cortex LLM REST API streams generated tokens back to the user as server-sent events.

Cost considerations

Snowflake Cortex LLM REST API requests incur compute costs based on the number of tokens processed. Refer to the Snowflake Service Consumption Table for each function’s cost in credits per million tokens. A token is the smallest unit of text processed by Snowflake Cortex LLM functions, approximately equal to four characters of text. The equivalence of raw input or output text to tokens can vary by model.

The COMPLETE function generates new text given an input prompt. Both input and output tokens incur compute cost. If you use COMPLETE to provide a conversational or chat user experience, all previous prompts and responses are processed to generate each new response, with corresponding costs.

Model availability

The following table contains the models that are available in the Cortex LLM REST API.

Function
(Model)
AWS US West 2
(Oregon)
AWS US East 1
(N. Virginia)
AWS Europe Central 1
(Frankfurt)
AWS Europe West 1
(Ireland)
AWS AP Southeast 2
(Sydney)
AWS AP Northeast 1
(Tokyo)
Azure East US 2
(Virginia)
Azure West Europe
(Netherlands)
COMPLETE
(llama3.1-8b)

COMPLETE
(llama3.1-70b)

COMPLETE
(llama3.1-405b)

COMPLETE
(mistral-7b)

COMPLETE
(mistral-large2)

Usage quotas

To ensure a high standard of performance for all Snowflake customers, Snowflake Cortex LLM REST API requests are subject to usage quotas beyond which requests may be throttled. Snowflake may adjust these quotas from time to time. The quotas in the table below are applied per account and are independently applied for each model.

Function
(Model)
Tokens Processed
per Minute (TPM)
Requests per
Minute (RPM)
COMPLETE
(llama3.1-8b)

400,000

200

COMPLETE
(llama3.1-70b)

200,000

100

COMPLETE
(llama3.1-405b)

100,000

50

COMPLETE
(mistral-7b)

400,000

200

COMPLETE
(mistral-large2)

200,000

100

COMPLETE endpoint

The /api/v2/cortex/inference:complete endpoint executes the SQL COMPLETE function. It takes the form:

POST https://<account_identifier>.snowflakecomputing.com/api/v2/cortex/inference:complete

where account_identifier is the account identifier you use to access Snowsight.

Note

Currently, only the COMPLETE function is supported. Additional functions may be supported in a future version of the Cortex LLM REST API.

Setting up authentication

Authenticating to the Cortex LLM REST API uses key-pair authentication. This requires creating an RSA key pair and assigning its public key to a user, which must be done using the SECURITYADMIN role (or another role that has had SECURITYADMIN granted, such as ACCOUNTADMIN). For step-by-step instructions, see Configuring key-pair authentication.

Tip

Consider creating a dedicated user for Cortex LLM REST API requests.

To make API requests, use the public key to create a JSON Web token (JWT) and pass it in the headers of the request.

Setting up authorization

Once you have created a key pair and assigned its public key to a user, that user’s default role needs to have the snowflake.cortex_user database role, which contains the privileges to use the LLM functions. In most cases, users already have this privilege, because it is granted to the PUBLIC role automatically, and all roles inherit PUBLIC.

If your Snowflake administrator prefers to opt in individual users, he or she might have revoked snowflake.cortex_user from PUBLIC, and must grant this role to the users who should be able to use the Cortex LLM REST API as follows.

GRANT DATABASE ROLE snowflake.cortex_user TO ROLE MY_ROLE;
GRANT ROLE MY_ROLE TO USER MY_USER;
Copy

Important

REST API requests use the user’s default role, so that role must have the necessary privileges. You can change a user’s default role with ALTER USER … SET DEFAULT ROLE.

ALTER USER MY_USER SET DEFAULT_ROLE=MY_ROLE
Copy

Submitting requests

You make a request to the Cortex LLM REST API by POSTing to the API’s REST endpoint. The Authorization header must contain a JSON Web token generated from your public key, which you can do using snowsql via the following command. The generated JWT expires after one hour.

snowsql -a <account_identifier> -u <user> --private-key-path <path>/rsa_key.p8 --generate-jwt
Copy

The body of the request is a JSON object that specifies the model, the prompt or conversation history, and options. See the following API Reference for details.

API Reference

POST /api/v2/cortex/inference:complete

Completes a prompt or conversation using the specified large language model. The body of the request is a JSON object containing the arguments.

This endpoint corresponds to the COMPLETE SQL function.

Required headers

X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT

Defines the type of authorization token.

Authorization: Bearer jwt.

Authorization for the request. jwt is a valid JSON Web token.

Content-Type: application/json

Specifies that the body of the request is in JSON format.

Accept: application/json, text/event-stream

Specifies that the response will either contain JSON (error case) or server-sent events.

Required JSON arguments

Argument

Type

Description

model

string

The identifier of the model to be used (see Choosing a model). Must be one of the following values.

  • gemma-7b

  • jamba-1.5-mini

  • jamba-1.5-large

  • jamba-instruct

  • llama2-70b-chat

  • llama3-8b

  • llama3-70b

  • llama3.1-8b

  • llama3.1-70b

  • llama3.1-405b

  • llama3.2-1b

  • llama3.2-3b

  • mistral-large

  • mistral-large2

  • mistral-7b

  • mixtral-8x7b

  • reka-core

  • reka-flash

  • snowflake-arctic

messages

array

The prompt or conversation history to be used to generate a completion. An array of objects representing a conversation in chronological order. Each object must contain a content key and may also contain a role key.

  • content: A string containing a system message, a prompt from the user, or a previous response from the model.

  • role: A string indicating the role of the message, one of 'system', 'user', or 'assistant'.

See the COMPLETE roles table for a more detailed description of these roles.

For prompts consisting of a single user message, role may be omitted; it is then assumed to be user.

Optional JSON arguments

Argument

Type

Default

Description

top_p

number

1.0

A value from 0 to 1 (inclusive) that controls the diversity of the language model by restricting the set of possible tokens that the model outputs.

temperature

number

0.0

A value from 0 to 1 (inclusive) that controls the randomness of the output of the language model by influencing which possible token is chosen at each step.

max_tokens

integer

4096

The maximum number of tokens to output. Output is truncated after this number of tokens.

Note

You can set max_tokens to a number greater than 4,096, but not greater than the model limit. See Model restrictions for each model’s token limit.

Output

Tokens are sent as they are generated using server-sent events (SSEs). Each SSE event uses the message type and contains a JSON object with the following structure.

Key

Value type

Description

'id'

string

Unique ID of the request, the same value for all events sent in response to the request.

'created'

number

UNIX timestamp (seconds since midnight, January 1, 1970) when the response was generated.

'model'

string

Identifier of the model.

'choices'

array

The model’s responses. Each response is an object containing a 'delta' key whose value is an object, whose 'content' key contains the new tokens generated by the model. Currently, only one response is provided.

Status codes

The Snowflake Cortex LLM REST API uses the following HTTP status codes to indicate successful completion or various error conditions.

200 OK

Request completed successfully. The body of the response contains the output of the model.

400 invalid options object

The optional arguments have invalid values.

400 unknown model model_name

The specified model does not exist.

400 max tokens of count exceeded

The request exceeded the maximum number of tokens supported by the model (see Model restrictions).

400 all requests were throttled by remote service

The request has been throttled due to a high level of usage. Try again later.

402 budget exceeded

The model consumption budget was exceeded.

403 Not Authorized

Account not enabled for REST API, or the default role for the calling user does not have the snowflake.cortex_user database role.

429 too many requests

The request was rejected because the usage quota has been exceeded. Please try your request later.

503 inference timed out

The request took too long.

Example

The following example uses curl to make a COMPLETE request. Replace jwt, prompt, and account_identifier with the appropriate values in this command.

curl -X POST \
    -H 'X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT' \
    -H "Authorization: Bearer <jwt>" \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json, text/event-stream' \
    -d '{
    "model": "mistral-large",
    "messages": [
        {
            "content": "<prompt>"
        }
    ],
    "top_p": 0,
    "temperature": 0
    }' \
https://<account_identifier>.snowflakecomputing.com/api/v2/cortex/inference:complete
Copy

Output

data: {
data:  "id": "65c5e2ac-529b-461e-8a8c-f80655e6bd3f",
data:  "created": 1723493954,
data:  "model": "mistral-7b",
data:  "choices": [
data:    {
data:      "delta": {
data:        "content": "Cor"
data:        }
data:      }
data:     ],
data:  "usage": {
data:    "prompt_tokens": 57,
data:    "completion_tokens": 1,
data:    "total_tokens": 58
data:  }
data: }

data: {
data:  "id": "65c5e2ac-529b-461e-8a8c-f80655e6bd3f",
data:  "created": 1723493954,
data:  "model": "mistral-7b",
data:  "choices": [
data:    {
data:      "delta": {
data:        "content": "tex"
data:        }
data:      }
data:     ],
data:  "usage": {
data:    "prompt_tokens": 57,
data:    "completion_tokens": 2,
data:    "total_tokens": 59
data:  }
data: }

Python API

To install the Python API, use:

pip install snowflake-ml-python
Copy

The Python API is included in the snowflake-ml-python package starting with version 1.6.1.

Example

To use the Python API, first create a Snowflake session (see Creating a Session for Snowpark Python). Then call the Complete API. The REST back end is used only when stream=True is specified.

from snowflake.snowpark import Session
from snowflake.cortex import Complete

session = Session.builder.configs(...).create()

stream = Complete(
  "mistral-7b",
  "What are unique features of the Snowflake SQL dialect?",
  session=session,
  stream=True)

for update in stream:
  print(update)
Copy

Note

The streaming mode of the Python API currently doesn’t work in stored procedures and in Snowsight.