Cortex REST API¶
The Cortex REST API gives you access to leading frontier models from Anthropic, OpenAI, Meta, Mistral, and more through your preferred endpoint or SDK. All inference runs within the Snowflake perimeter, so your data remains secure and within your governance boundary. See below on how to get started.
Choose your API¶
Cortex REST API supports two industry-standard API specifications. Pick the one that best fits your stack:
Chat Completions API |
Messages API |
|
|---|---|---|
Compatibility |
||
Endpoint |
|
|
Supported models |
All models (OpenAI, Claude, Llama, Mistral, DeepSeek, Snowflake) |
Claude models only |
SDK support |
OpenAI Python and JavaScript SDKs |
Anthropic Python SDK |
Best for |
Most use cases; multi-model flexibility |
Existing Anthropic integrations; Anthropic API parity |
Both APIs share the same authentication, model catalog, and rate limits. The only difference is the request/response format and which models each endpoint supports. For pricing, see the Snowflake Service Consumption Table.
Quickstart¶
Prerequisites¶
Before you begin, you need:
Your Snowflake account URL (e.g.,
https://<account-identifier>.snowflakecomputing.com).A Snowflake Programmatic Access Token (PAT) for authentication. See Generating a programmatic access token.
A model name to use in requests. See Model availability for available models.
Chat Completions quickstart¶
The Chat Completions API follows the OpenAI specification. You can use the OpenAI SDK directly.
In the preceding examples, replace the following:
<account-identifier>: Your Snowflake account identifier.<SNOWFLAKE_PAT>: Your Snowflake Programmatic Access Token (PAT).model: The model name. See Model availability for supported models.
Messages API quickstart¶
The Messages API follows the Anthropic specification and supports Claude models only.
The Anthropic SDK sends credentials via x-api-key by default, but Snowflake expects a Bearer token.
Use an httpx client to set the correct authorization header.
Like Python, override the default auth header with a Bearer token via defaultHeaders.
In the preceding examples, replace the following:
<account-identifier>: Your Snowflake account identifier.<SNOWFLAKE_PAT>: Your Snowflake Programmatic Access Token (PAT).model: The Claude model name. See Model availability for supported models.
Setting up authentication¶
To authenticate to the Cortex REST API, you can use the methods described in Authenticating Snowflake REST APIs with Snowflake.
Set the Authorization header to include your token (for example, a JSON web token (JWT), OAuth token, or
programmatic access token).
Tip
Consider creating a dedicated user for Cortex REST API requests.
Model availability¶
The following tables show the models available in the Cortex REST API for each region:
Model
|
Cross Cloud
(Any Region)
|
AWS Global
(Cross-Region)
|
AWS US
(Cross-Region)
|
AWS EU
(Cross-Region)
|
AWS APJ
(Cross-Region)
|
Azure Global
(Cross-Region)
|
Azure US
(Cross-Region)
|
Azure EU
(Cross-Region)
|
|---|---|---|---|---|---|---|---|---|
|
✔ |
✔ |
✔ |
✔ |
||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
✔ |
✔ |
||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
✔ |
|||||
|
✔ |
✔ |
✔ |
✔ |
||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
* |
* |
* |
* |
||||
|
* |
* |
||||||
|
* |
* |
||||||
|
✔ |
|||||||
|
* |
|||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
Model
|
AWS US West 2
(Oregon)
|
AWS US East 1
(N. Virginia)
|
Azure East US 2
(Virginia)
|
|---|---|---|---|
|
✔ |
✔ |
|
|
✔ |
||
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
||
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
Model
|
AWS Europe Central 1
(Frankfurt)
|
AWS Europe West 1
(Ireland)
|
Azure West Europe
(Netherlands)
|
|---|---|---|---|
|
✔ |
✔ |
|
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
|
|
✔ |
✔ |
|
|
✔ |
✔ |
✔ |
Model
|
AWS AP Southeast 2
(Sydney)
|
AWS AP Northeast 1
(Tokyo)
|
|---|---|---|
|
✔ |
✔ |
|
✔ |
✔ |
|
✔ |
|
|
✔ |
|
|
✔ |
✔ |
* Indicates a preview function or model. Preview features are not suitable for production workloads.
You can also use any fine-tuned model in any supported region.
Features¶
Streaming¶
Both APIs support streaming responses using server-sent events.
Chat Completions streaming¶
Messages API streaming¶
Tool calling¶
Tool calling lets the model invoke external functions during a conversation. The flow works in steps:
You send a request with a list of available tools.
The model decides to call one or more tools and returns the tool name and arguments.
You execute the tool on your end.
You send the tool result back, and the model generates a final response.
Tool calling is supported for OpenAI and Claude models.
Chat Completions tool calling¶
Step 1 — Send the request with tools:
The model responds with a tool_calls array:
Step 2 — Execute the tool and send the result back:
Messages API tool calling¶
Step 1 — Send the request with tools:
The model responds with a tool_use content block:
Step 2 — Execute the tool and send the result back:
Structured output¶
You can request structured JSON output that conforms to a specific schema. This is supported for OpenAI and Claude
models through the Chat Completions API. For the Messages API, use the tool_use pattern to enforce structured output.
Chat Completions structured output¶
Use the response_format field with a JSON schema to constrain the model’s output.
Note
Claude models support only json_schema as the response format type. OpenAI models support additional
response format types as documented in the OpenAI API reference.
Messages API structured output¶
The Messages API does not have a response_format field. Instead, define a tool with your desired output schema
and instruct the model to use it. The model’s tool_use response will contain structured JSON matching your schema.
Image input¶
You can include images in your requests for models that support vision. Images must be provided as base64-encoded strings. Images are limited to 20 per conversation with a 20 MiB max request size.
Image input is supported for:
Claude models (
claude-3-7-sonnetand newer)OpenAI models (
openai-gpt-4.1,openai-gpt-5,openai-gpt-5-chat,openai-gpt-5-mini,openai-gpt-5-nano)
Chat Completions image input¶
Messages API image input¶
The Messages API uses a different image format — a source block with type, media_type, and data fields
instead of a data URL.
Prompt caching¶
Prompt caching lets you reuse previously processed context (such as large system prompts, documents, or conversation history) across requests, reducing latency and cost.
OpenAI models: Caching is implicit. Prompts with 1,024+ tokens are automatically cached — no request changes needed.
Claude models: Caching is explicit. Add
cache_controlbreakpoints to content blocks you want cached. Only theephemeralcache type is supported, with a 5-minute TTL. A maximum of 4 cache breakpoints per request.
Chat Completions prompt caching¶
For Claude models via Chat Completions, add cache_control to content blocks. OpenAI models are cached
automatically and do not require this field.
Messages API prompt caching¶
Use cache_control on system or user content blocks. Only the ephemeral cache type is supported,
with a 5-minute TTL. A maximum of 4 cache breakpoints can be set per request.
Note
Anthropic prompt caching has a 5-minute TTL. Cached content not accessed within 5 minutes is evicted.
OpenAI prompt caching is implicit and managed automatically — no cache_control fields needed.
Thinking and reasoning¶
Chat Completions thinking¶
For Claude models, use the reasoning object. For OpenAI reasoning models, use the reasoning_effort field
(values: minimal, low, medium, high).
Messages API thinking¶
Some Claude models support adaptive thinking, where the model adjusts how much reasoning it applies based on task complexity. The following models support adaptive thinking:
claude-opus-4-6
For the Messages API, use the thinking parameter with type: "adaptive" to enable adaptive thinking. The output_config.effort parameter provides some high-level control over the thinking depth, and accepts the following values:
Effort level |
Behavior |
|---|---|
|
Always thinks with no constraints on thinking depth. Claude Opus 4.6 only. |
|
Always thinks. Provides deep reasoning on complex tasks. |
|
Moderate thinking. May skip thinking for very simple queries. |
|
Minimizes thinking. Skips thinking for simple tasks where speed matters most. |
The following examples demonstrate how to make a Messages API call with adaptive thinking enabled:
The response includes thinking blocks with summarized thinking and thinking signatures. Pass these blocks back in multi-turn conversations to maintain reasoning context:
For a full description of the Messages API support for Adaptive Thinking, see Claude API Docs – Adaptive thinking.
Beta features (Messages API)¶
The Messages API supports Anthropic beta features via the anthropic-beta header. Pass one or more beta header
values as a comma-separated string.
Beta header value |
Feature |
|---|---|
|
Token-efficient tools |
|
Interleaved thinking |
|
Enables output tokens up to 128K |
|
Developer mode for raw thinking on Claude 4+ models |
|
1 million token context window |
|
Context management |
|
Effort parameter for thinking |
|
Tool search tool |
|
Tool use examples |
The following example enables the 1 million token context window with claude-sonnet-4-6:
You can combine multiple beta features by passing a comma-separated string:
Chat Completions API reference¶
POST /api/v2/cortex/v1/chat/completions¶
Generates a chat completion using the specified model. The request and response format follows the OpenAI Chat Completions API specification.
Required headers¶
Authorization: Bearer tokenAuthorization for the request.
tokenis a JSON web token (JWT), OAuth token, or programmatic access token. For details, see Authenticating Snowflake REST APIs with Snowflake.Content-Type: application/jsonSpecifies that the body of the request is in JSON format.
Optional headers¶
X-Snowflake-Authorization-Token-Type: typeDefines the type of authorization token.
If you omit the
X-Snowflake-Authorization-Token-Typeheader, Snowflake determines the token type by examining the token.Even though this header is optional, you can choose to specify this header. You can set the header to one of the following values:
KEYPAIR_JWT(for key-pair authentication)OAUTH(for OAuth)PROGRAMMATIC_ACCESS_TOKEN(for programmatic access tokens)
Accept: application/json, text/event-streamSpecifies that the response will either contain JSON (error case) or server-sent events.
Required JSON fields¶
Field |
Type |
Description |
|---|---|---|
|
string |
The model to use (see Model availability).
You may also use the fully-qualified name of any
fine-tuned model in the format
|
|
array |
An array of message objects representing the conversation. Each message must have a |
Commonly used optional JSON fields¶
Field |
Type |
Default |
Description |
|---|---|---|---|
|
integer |
4096 |
Maximum tokens in the response. Theoretical maximum is 131,072; each model has its own output limit. |
|
number |
Varies by model |
Controls randomness. Values from 0 to 2. |
|
number |
1.0 |
Controls diversity via nucleus sampling. |
|
boolean |
false |
Whether to stream back partial progress as server-sent events. |
|
array |
null |
A list of tools the model may call. Each tool must have |
|
string or object |
|
Controls how the model selects tools. Options: |
|
object |
null |
Constrains the output format. Use |
|
string |
null |
For OpenAI reasoning models. Values: |
|
object |
null |
For Claude models. Set |
See the detailed compatibility chart for the full list of supported fields per model family.
Status codes¶
- 200
OK Request completed successfully.
- 400
invalid options object The optional arguments have invalid values.
- 400
unknown model model_name The specified model does not exist.
- 400
schema validation failed The response schema structure is incorrect.
- 400
max tokens of count exceeded The request exceeded the maximum number of tokens supported by the model.
- 400
all requests were throttled by remote service The request has been throttled. Try again later.
- 402
budget exceeded The model consumption budget was exceeded.
- 403
Not Authorized Account not enabled for REST API, or the default role for the calling user does not have the
snowflake.cortex_userdatabase role.- 429
too many requests The usage quota has been exceeded. Try again later.
- 503
inference timed out The request took too long.
Limitations¶
If unset,
max_completion_tokensdefaults to 4096. Each model has its own output token limit.Tool calling is supported for OpenAI and Claude models only.
Audio is not supported.
Image understanding is supported for OpenAI and Claude models only. Images are limited to 20 per conversation with a 20 MiB max request size.
Only Claude models support ephemeral cache control points for prompt caching. OpenAI models support implicit caching.
Only Claude models support returning reasoning details in the response.
max_tokensis deprecated. Usemax_completion_tokensinstead.Error messages are generated by Snowflake, not by the model provider.
Detailed compatibility chart¶
The following tables summarize which request and response fields are supported when using the Chat Completions API with different Snowflake-hosted model families.
Field |
OpenAI Models |
Claude Models |
Other Models |
|---|---|---|---|
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
❌ Error |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Only user/assistant/system |
✔ Only user/assistant/system |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
❌ Error |
|
❌ Ignored |
✔ Supported (ephemeral only) |
❌ Ignored |
|
❌ Error |
❌ Error |
❌ Ignored |
|
❌ Error |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported (deprecated) |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Supported |
❌ Ignored |
|
✔ Supported |
✔ Only |
❌ Ignored |
|
❌ Ignored |
✔ OpenRouter format |
❌ Ignored |
|
❌ Error |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
❌ Error (deprecated) |
❌ Error (deprecated) |
❌ Error (deprecated) |
|
✔ Supported (4096 default, 131072 max) |
✔ Supported (4096 default, 131072 max) |
✔ Supported (4096 default, 131072 max) |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored (use |
❌ Ignored |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported (overrides |
✔ Converted to |
❌ Ignored |
|
❌ Ignored |
✔ Supported |
❌ Ignored |
|
✔ Supported |
✔ Only |
❌ Ignored |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
❌ Error |
❌ Error |
❌ Error |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
❌ Error |
❌ Error |
❌ Error |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Only |
❌ Ignored |
|
✔ Supported |
✔ Only |
❌ Error |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
❌ Error |
❌ Ignored |
❌ Ignored |
Field |
OpenAI Models |
Claude Models |
Other Models |
|---|---|---|---|
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Single choice only |
✔ Single choice only |
|
✔ Supported |
❌ Not supported |
✔ Only |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
✔ Only |
❌ Not supported |
|
❌ Not supported |
✔ OpenRouter format |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
✔ Only |
❌ Not supported |
|
❌ Not supported |
✔ OpenRouter format |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
✔ Only cache reads |
✔ Cache read + write |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
Header |
Support |
|---|---|
|
✔ Required |
|
✔ Supported ( |
|
✔ Supported ( |
Header |
Support |
|---|---|
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
Learn more¶
For additional usage examples, see the OpenAI Chat Completions API reference or the OpenAI Cookbook.
In addition to providing compatibility with the Chat Completions API, Snowflake supports OpenRouter-compatible features for Claude models. These features are exposed as extra fields on the request:
For prompt caching, use the
cache_controlfield. See the OpenRouter prompt caching documentation.For reasoning tokens, use the
reasoningfield. See the OpenRouter reasoning documentation.
Messages API reference¶
POST /api/v2/cortex/v1/messages¶
Generates a response using a Claude model. The request and response format follows the Anthropic Messages API specification.
Note
The Messages API supports Claude models only. For other models, use the Chat Completions API.
Required headers¶
Authorization: Bearer tokenAuthorization for the request.
tokenis a JSON web token (JWT), OAuth token, or programmatic access token. For details, see Authenticating Snowflake REST APIs with Snowflake.Content-Type: application/jsonSpecifies that the body of the request is in JSON format.
anthropic-version: 2023-06-01Required Anthropic API version header.
Optional headers¶
X-Snowflake-Authorization-Token-Type: typeDefines the type of authorization token.
If you omit the
X-Snowflake-Authorization-Token-Typeheader, Snowflake determines the token type by examining the token.Even though this header is optional, you can choose to specify this header. You can set the header to one of the following values:
KEYPAIR_JWT(for key-pair authentication)OAUTH(for OAuth)PROGRAMMATIC_ACCESS_TOKEN(for programmatic access tokens)
anthropic-beta: featureEnables beta features. Only Bedrock-compatible beta headers are supported.
Required JSON fields¶
Field |
Type |
Description |
|---|---|---|
|
string |
The Claude model to use (see Model availability). |
|
integer |
The maximum number of tokens to generate. |
|
array |
An array of message objects. Each message has a |
Supported features¶
The Messages API supports the standard Anthropic Messages API feature set for Claude models, including:
Text generation and multi-turn conversations
Streaming (
"stream": true)System messages (via top-level
systemfield)Tool calling (Anthropic format with
name,description,input_schema)Image input (base64 source blocks)
Prompt caching (
cache_controlon content blocks)Extended thinking (
thinkingparameter withbudget_tokens)
For full request and response schema details, see the Anthropic Messages API documentation.
Limitations¶
Claude models only. OpenAI, Llama, Mistral, and other models are not available through this endpoint.
No flex processing or priority tier. The
service_tierfield is not supported.Bedrock beta headers only. Only Bedrock-compatible
anthropic-betaheader values are supported.Error messages are generated by Snowflake, not by Anthropic.
Status codes¶
- 200
OK Request completed successfully.
- 400
invalid_request_error The request body is malformed or contains invalid values.
- 400
unknown model model_name The specified model does not exist or is not a Claude model.
- 402
budget exceeded The model consumption budget was exceeded.
- 403
Not Authorized Account not enabled for REST API, or the default role does not have the
snowflake.cortex_userdatabase role.- 429
too many requests The usage quota has been exceeded. Try again later.
- 503
inference timed out The request took too long.
Rate limits¶
To ensure high performance for all Snowflake customers, Cortex REST API requests are subject to rate limits. Requests exceeding the limits may receive an HTTP 429 response. Snowflake may occasionally adjust these limits.
The default limits in the following tables are applied per account and independently for each model. Ensure your application handles 429 responses gracefully by retrying with exponential backoff.
If you need to increase the limits, contact Snowflake Support.
Model
|
Tokens Processed
per Minute (TPM)
|
Requests per
Minute (RPM)
|
Max output (tokens)
|
|---|---|---|---|
claude-3-5-sonnet |
300,000 |
300 |
16,384 |
claude-3-7-sonnet |
300,000 |
300 |
16,384 |
claude-sonnet-4-5 |
600,000 |
600 |
16,384 |
claude-haiku-4-5 |
600,000 |
600 |
16,384 |
claude-4-sonnet |
300,000 |
300 |
16,384 |
claude-4-opus |
75,000 |
75 |
16,384 |
deepseek-r1 |
100,000 |
100 |
16,384 |
llama3.1-8b |
400,000 |
400 |
16,384 |
llama3.1-70b |
200,000 |
200 |
16,384 |
llama3.1-405b |
100,000 |
100 |
16,384 |
mistral-7b |
400,000 |
400 |
16,384 |
mistral-large2 |
200,000 |
200 |
16,384 |
openai-gpt-4.1 |
300,000 |
300 |
16,384 |
openai-gpt-5 |
300,000 |
300 |
16,384 |
openai-gpt-5-chat |
300,000 |
300 |
16,384 |
openai-gpt-5-mini |
1,000,000 |
1,000 |
16,384 |
openai-gpt-5-nano |
5,000,000 |
5,000 |
16,384 |
Increase rate limits with cross-region inference¶
If you set up cross-region inference in your Snowflake Account, the rate limits are higher for the following models:
Model
|
Tokens Processed
per Minute (TPM)
|
Requests per
Minute (RPM)
|
Max output (tokens)
|
|---|---|---|---|
claude-3-7-sonnet |
600,000 |
600 |
16,384 |
claude-haiku-4-5 |
600,000 |
600 |
16,384 |
claude-sonnet-4-5 |
600,000 |
600 |
16,384 |
claude-4-sonnet |
1,200,000 |
1,200 |
16,384 |
claude-4-opus |
150,000 |
150 |
16,384 |
llama3.1-8b |
800,000 |
400 |
16,384 |
llama3.1-70b |
400,000 |
200 |
16,384 |
llama3.1-405b |
200,000 |
100 |
16,384 |
Troubleshooting rate limit events¶
Offending either the TPM or RPM limits will result in a 429 response code. If your REST API usage is below the request per minute rate limit but still received a 429 response code, double check the token usage rate.
Cortex REST API implements rate limits using the Sliding Window Counter pattern. The counters are stored in a highly-available Redis cluster only accessible by Snowflake Cortex within Snowflake’s private network.
The sliding-window counter assumes that client traffic to the API in the previous time window is uniformly distributed. When traffic is spiky, this assumption could overestimate the rate of requests, but recovers quickly given the window is short. Please contact Snowflake Support if you are subject to the overestimation and want to increase the limits.
Known issues¶
Session token expiration¶
We recommended authenticating with one of the three methods defined in Authenticating Snowflake REST APIs with Snowflake. However, if you choose to authenticate with a Snowflake session token, you must handle token refresh to ensure uninterrupted API access.
Session tokens expire periodically. If a request is executed with an expired session token, the REST API returns a 200 OK response that includes error code 390112. When this occurs, the operation is not performed.
To handle this behavior, your application should:
Check each API response for error code
390112, even when the HTTP status code is200 OK.When error code
390112is detected, refresh the session token and retry the request.
Note
This behavior only affects applications using Snowflake session tokens. If you authenticate using key pair authentication, OAuth, or programmatic access tokens (PATs), you do not need to implement this error handling.
Cost considerations¶
Snowflake Cortex REST API requests incur compute costs based on the number of tokens processed. Refer to the Snowflake Service Consumption Table for each model’s cost in dollars per million tokens.
A token is the smallest unit of text processed by Snowflake Cortex LLM functions, approximately equal to four characters of text. The equivalence of raw input or output text to tokens can vary by model.
Both input and output tokens incur compute cost. If you use the API to provide a conversational or chat user experience, all previous prompts and responses are processed to generate each new response, with corresponding costs.
