- Categories:
String & binary functions (AI Functions)
AI_EXTRACT¶
Extracts information from an input string or file.
Syntax¶
Extract information from an input string:
Extract information from a file:
Arguments¶
textAn input string for extraction.
fileA FILE for extraction.
Supported file formats:
PDF
PNG
PPTX, PPT
EML
DOC, DOCX
JPEG, JPG
HTM, HTML
TEXT, TXT
TIF, TIFF
BMP, GIF, WEBP
MD
The files must be less than 100 MB in size.
responseFormatInformation to be extracted. The format depends on the type of extraction.
Entity extraction formats
Extract single values by providing one of the following formats:
Simple object schema that maps the label and information to be extracted:
An array of strings that contain the information to be extracted:
An array of arrays that contain two strings (label and the information to be extracted):
A JSON schema with
'type': 'string'on the sub-object:
List extraction format
Extract arrays of values using a JSON schema with
'type': 'array'on the sub-object:Table extraction format
Extract tabular data using a JSON schema with
'type': 'object'andcolumn_ordering. Each column is defined as a nested property with'type': 'array'and adescriptionthat matches the column name in the file:Note
You can’t combine the JSON schema format with other response formats. If
responseFormatcontains theschemakey, you must define all questions within the JSON schema. Additional keys are not supported.The model only accepts certain shapes of JSON schema. Top level type must always be an object, which contains independently extracted sub-objects. Sub-objects may be a table (object of lists of strings representing columns), a list of strings, or a string.
String is currently the only supported scalar type.
Use the
descriptionfield to provide context to the model; for example, to help the model localize the right table in a document. You can enter the column header name, or describe the column in other way.Use the
column_orderingfield to specify the order of all columns in the extracted table. Thecolumn_orderingfield is case-sensitive and must match the column names defined in thepropertiesfield. The order should reflect the order of the columns in the document.
scores => booleanOptional. Supported only in the named-argument syntax shown above. A BOOLEAN that controls whether the function returns scores for extracted values. The default is
FALSE. WhenTRUE, the JSON result includes ascoringobject in addition toresponse. For the output format, SQL examples, and limitations, see Extraction scores (preview).config => config_objectAn OBJECT value that specifies the configuration settings. You can use an OBJECT constant to specify this object.
You can specify the following key-value pairs in this object:
Key
Description
scale_factorA numeric value from 1.0 through 4.0. Scales pages of an input file before they are processed by the underlying model, which can enhance OCR quality and improve extraction results.
Use
scale_factorif you receive unexpected or unclear responses in the following scenarios:Documents with page sizes larger than A4
Documents containing small text, detailed visual elements, or dense layouts
Extracted text contains typos or character-level OCR errors
If omitted, AI_EXTRACT uses the default value (
'scale_factor': 1.0).
Returns¶
A JSON object containing the extracted information. The structure of the response depends on the type of extraction.
Entity extraction¶
Returns a JSON object with key-value pairs for each extracted entity:
List extraction¶
Returns a JSON object with arrays of extracted values:
Table extraction¶
Returns a JSON object with column arrays representing the extracted table:
Combined extraction¶
When extracting entities, lists, and tables in a single call, the response contains all extraction types:
Extraction scores (preview)¶
When you use AI_EXTRACT, you can request scores that indicate the model’s certainty about each extracted value. You can use these scores to set thresholds for business logic, such as flagging low-scoring extractions for human review.
A higher score indicates a higher likelihood that the extracted value is correct. You can compare scores for extracting a given entity across different documents to identify which values are more or less reliable, and use them to build deterministic processing logic such as thresholds, fallback mechanisms, and human-in-the-loop workflows.
How scores work¶
When you set the scores parameter to TRUE, AI_EXTRACT returns a scoring object alongside the standard
response object. The scoring object contains a score for each extracted field.
The scores parameter is optional, and it is set to FALSE by default. Use the optional scores argument in the
named-argument syntax shown in Arguments.
Scoring output format¶
When scores => TRUE, the returned JSON includes a scoring object:
Each field in scoring.scores corresponds to a field in response and contains a score value between 0 and 1.
For list extraction, the scoring object returns an aggregate score for the entire list:
For table extraction, the scoring object returns an aggregate score for the entire table:
Scoring usage notes¶
Requesting scores does not incur additional cost. For general information on AI_EXTRACT costs, see Cost considerations.
Per-element scores for individual list items and table cells are not available.
Scores are supported for fine-tuned models.
Examples with extraction scores¶
The following example extracts information from a file and returns scores for each extracted field:
Result:
The following example extracts a list of buyer names and returns an aggregate score:
Result:
The following example extracts a table and returns an aggregate score:
Result:
Access control requirements¶
Users must use a role that has been granted the SNOWFLAKE.CORTEX_USER database role. For information about granting this privilege, see Cortex LLM privileges.
Usage notes¶
AI_EXTRACT is optimized for documents both digital-born and scanned.
You can’t use both
textandfileparameters simultaneously in the same function call.You can either ask questions in natural language or describe information to be extracted (such as city, street, ZIP code); for example:
The following languages are supported:
Arabic
Bengali
Burmese
Cebuano
Chinese
Czech
Dutch
English
French
German
Hebrew
Hindi
Indonesian
Italian
Japanese
Khmer
Korean
Lao
Malay
Persian
Polish
Portuguese
Russian
Spanish
Tagalog
Thai
Turkish
Urdu
Vietnamese
The documents must be no more than 125 pages long.
In a single AI_EXTRACT call, you can ask a maximum of 100 questions for entity extraction, and a maximum of 10 questions for table extraction.
A table extraction question is equal to 10 entity extraction questions. For example, you can ask 4 table extraction questions and 60 entity extraction questions in a single AI_EXTRACT call.
The maximum output length for entity extraction is 512 tokens per question. For table extraction, the model returns answers that are a maximum of 4096 tokens.
Client-side encrypted stages are not supported.
Optional extraction scores are available in preview when you use named arguments and pass
scores => TRUE. For details, see Extraction scores (preview).
Cost considerations¶
The Cortex AI_EXTRACT function incurs compute cost based on the number of pages per document, input prompt tokens, and output tokens processed.
For paged file formats (PDF, DOCX, TIF, TIFF), each page is counted as 970 tokens.
For image file formats (JPEG, JPG, PNG), each individual image file is billed as a page and counted as 970 tokens.
Using the
scale_factorparameter changes how many tokens are consumed and how many pages can be processed per call:The number of input tokens consumed increases proportionally with
scale_factor.The maximum number of pages per document that can be processed by AI_EXTRACT decreases by
scale_factor.
Relationship of scale_factor to number of tokens and pages
scale_factorvalueToken count per page
Max. number of pages per document
2
970 * 2 = 1940 tokens
125/2 = 62.5 (rounded down to 62)
2.5
970 * 2.5 = 2425 tokens
125/2.5 = 50
4
970 * 4 = 3880 tokens
125/4 = 31.25 (rounded down to 31)
Snowflake recommends executing queries that call the Cortex AI_EXTRACT function in a smaller warehouse (no larger than MEDIUM). Larger warehouses don’t increase performance.
Regional availability¶
AI_EXTRACT is available to accounts in the following regions:
Cloud platform |
Region name |
|---|---|
Amazon Web Services (AWS) |
|
Microsoft Azure |
|
AI_EXTRACT has cross-region support. For information on enabling Cortex AI cross-region support, see Cross-region inference.
Error conditions¶
AI_EXTRACT can produce the following error messages:
Message |
Explanation |
|---|---|
|
A system error occurred. Wait and try again. If the error persists, contact Snowflake support. |
|
The file was not found. |
|
The file was not found. |
|
The current user does not have sufficient privileges to access the file. |
|
The document is not in a supported format. |
|
The document is not stored in a stage with server-side encryption. |
|
No parameters were provided. |
|
No response format was provided. |
|
The response format is not valid JSON. |
|
The response format contains one or more duplicate feature names. |
|
The number of questions exceeds the allowed limit. |
|
The document exceeds the 125-page limit. |
|
Image input or a converted document page is larger than the supported dimensions. |
|
Page is larger than the supported dimensions. |
|
The document is larger than 100 MB. |
Examples¶
Entity extraction¶
The following example extracts entities from the input text using a simple object schema:
The following example extracts and parses entities from the input text:
The following example extracts entities from the
document.pdffile:The following example extracts entities from all files in a directory on a stage:
Note
Ensure that the directory table is enabled. For more information, see Manage directory tables.
The following example extracts the
titleentity from thereport.pdffile using a JSON schema:
List extraction¶
The following example extracts the employees list from the report.pdf file:
Table extraction¶
The following example extracts the income_table table from the report.pdf file:
Combined extraction¶
The following example extracts a table (income_table), entity (title), and list (employees) from the report.pdf
file in a single call:
Extraction with a custom scale factor¶
The following example extracts the employees array from the report.pdf file using a scale factor of 2.0:
Extraction using a fine-tuned arctic-extract model¶
To use the fine-tuned arctic-extract model for inference with the AI_EXTRACT function,
specify the model using the model parameter as shown in the following example:
You can overwrite questions used for fine-tuning by using the responseFormat parameter as shown in the following example:
The following example extracts data from the invoice.pdf file, using a fine-tuned arctic-extract model and a scale factor of 2.0:
For more information, see Fine-tuning arctic-extract models.
Legal notices¶
Refer to Snowflake AI and ML for legal notices.
