- Categories:
String & binary functions (AI Functions)
AI_EXTRACT¶
Extracts information from an input string or file.
Syntax¶
Extract information from an input string:
Extract information from a file:
Arguments¶
textAn input string for extraction.
fileA FILE for extraction.
Supported file formats:
PDF
PNG
PPTX, PPT
EML
DOC, DOCX
JPEG, JPG
HTM, HTML
TEXT, TXT
TIF, TIFF
BMP, GIF, WEBP
MD
The files must be less than 100 MB in size.
responseFormatInformation to be extracted. The format depends on the type of extraction.
Entity extraction formats
Extract single values by providing one of the following formats:
Simple object schema that maps the label and information to be extracted:
An array of strings that contain the information to be extracted:
An array of arrays that contain two strings (label and the information to be extracted):
A JSON schema with
'type': 'string'on the sub-object:
List extraction format
Extract arrays of values using a JSON schema with
'type': 'array'on the sub-object:Table extraction format
Extract tabular data using a JSON schema with
'type': 'object'andcolumn_ordering. Each column is defined as a nested property with'type': 'array'and adescriptionthat matches the column name in the file:Note
You can’t combine the JSON schema format with other response formats. If
responseFormatcontains theschemakey, you must define all questions within the JSON schema. Additional keys are not supported.The model only accepts certain shapes of JSON schema. Top level type must always be an object, which contains independently extracted sub-objects. Sub-objects may be a table (object of lists of strings representing columns), a list of strings, or a string.
String is currently the only supported scalar type.
Use the
descriptionfield to provide context to the model; for example, to help the model localize the right table in a document. You can enter the column header name, or describe the column in other way.Use the
column_orderingfield to specify the order of all columns in the extracted table. Thecolumn_orderingfield is case-sensitive and must match the column names defined in thepropertiesfield. The order should reflect the order of the columns in the document.
Returns¶
A JSON object containing the extracted information. The structure of the response depends on the type of extraction.
Entity extraction¶
Returns a JSON object with key-value pairs for each extracted entity:
List extraction¶
Returns a JSON object with arrays of extracted values:
Table extraction¶
Returns a JSON object with column arrays representing the extracted table:
Combined extraction¶
When extracting entities, lists, and tables in a single call, the response contains all extraction types:
Access control requirements¶
Users must use a role that has been granted the SNOWFLAKE.CORTEX_USER database role. For information about granting this privilege, see Cortex LLM privileges.
Usage notes¶
AI_EXTRACT is optimized for documents both digital-born and scanned.
You can’t use both
textandfileparameters simultaneously in the same function call.You can either ask questions in natural language or describe information to be extracted (such as city, street, ZIP code); for example:
The following languages are supported:
Arabic
Bengali
Burmese
Cebuano
Chinese
Czech
Dutch
English
French
German
Hebrew
Hindi
Indonesian
Italian
Japanese
Khmer
Korean
Lao
Malay
Persian
Polish
Portuguese
Russian
Spanish
Tagalog
Thai
Turkish
Urdu
Vietnamese
The documents must be no more than 125 pages long.
In a single AI_EXTRACT call, you can ask a maximum of 100 questions for entity extraction, and a maximum of 10 questions for table extraction.
A table extraction question is equal to 10 entity extraction questions. For example, you can ask 4 table extraction questions and 60 entity extraction questions in a single AI_EXTRACT call.
The maximum output length for entity extraction is 512 tokens per question. For table extraction, the model returns answers that are a maximum of 4096 tokens.
Client-side encrypted stages are not supported.
Confidence scores are not supported.
Cost considerations¶
The Cortex AI_EXTRACT function incurs compute cost based on the number of pages per document, input prompt tokens, and output tokens processed.
For paged file formats (PDF, DOCX, TIF, TIFF), each page is counted as 970 tokens.
For image file formats (JPEG, JPG, PNG), each individual image file is billed as a page and counted as 970 tokens.
Snowflake recommends executing queries that call the Cortex AI_EXTRACT function in a smaller warehouse (no larger than MEDIUM). Larger warehouses do not increase performance.
Regional availability¶
AI_EXTRACT is available to accounts in the following regions:
Cloud platform |
Region name |
|---|---|
Amazon Web Services (AWS) |
|
Microsoft Azure |
|
AI_EXTRACT has cross-region support. For information on enabling Cortex AI cross-region support, see Cross-region inference.
Error conditions¶
AI_EXTRACT can produce the following error messages:
Message |
Explanation |
|---|---|
|
A system error occurred. Wait and try again. If the error persists, contact Snowflake support. |
|
The file was not found. |
|
The file was not found. |
|
The current user does not have sufficient privileges too access the file. |
|
The document is not in a supported format. |
|
The document is not stored in a stage with server-side encryption. |
|
No parameters were provided. |
|
No response format was provided. |
|
The response format is not valid JSON. |
|
The response format contains one or more duplicate feature names. |
|
The number of questions exceeds the allowed limit. |
|
The document exceeds the 125-page limit. |
|
Image input or a converted document page is larger than the supported dimensions. |
|
Page is larger than the supported dimensions. |
|
The document is larger than 100 MB. |
Examples¶
Entity extraction¶
The following example extracts entities from the input text using a simple object schema:
The following example extracts and parses entities from the input text:
The following example extracts entities from the
document.pdffile:The following example extracts entities from all files in a directory on a stage:
Note
Ensure that the directory table is enabled. For more information, see Manage directory tables.
The following example extracts the
titleentity from thereport.pdffile using a JSON schema:
List extraction¶
The following example extracts the employees list from the report.pdf file:
Table extraction¶
The following example extracts the income_table table from the report.pdf file:
Combined extraction¶
The following example extracts a table (income_table), entity (title), and list (employees) from the report.pdf
file in a single call:
Extraction using a fine-tuned arctic-extract model¶
To use the fine-tuned arctic-extract model for inference with the AI_EXTRACT function,
specify the model using the model parameter as shown in the following example:
You can overwrite questions used for fine-tuning by using the responseFormat parameter as shown in the following example:
For more information, see Fine-tuning arctic-extract models.
Legal notices¶
Refer to Snowflake AI and ML for legal notices.