The Cortex PARSE_DOCUMENT function¶
The PARSE_DOCUMENT function is a Cortex AI task-specific function that gives you the ability to extract text or layout from documents stored in an internal or external stage. PARSE_DOCUMENT combines powerful Optical Character Recognition (OCR) capabilities with machine learning models to identify text content, information stored in tables, and the structural elements of PDF documents. You can use the PARSE_DOCUMENT function to extract text and document layout to build information retrieval systems on large archives of business documents, and to load the extracted information into structured Snowflake tables for use by your applications.
How PARSE_DOCUMENT works¶
The PARSE_DOCUMENT function offers OCR (default), and LAYOUT modes for processing PDF documents.
PARSE_DOCUMENT OCR (default) mode is optimized for text extraction from text-heavy documents. This is the recommended option for quick, easy, and effective text extraction from documents that do not have a strong semantic structure.
PARSE_DOCUMENT LAYOUT (optional) mode is optimized for extracting text and layout elements like tables. This is the recommended option to improve the context of a document knowledge base to optimize retrieval information systems, and Large Language Model (LLM) inference. For example, you can isolate text sections using LAYOUT elements for more targeted entity extraction tasks.
Using PARSE_DOCUMENT¶
The Cortex PARSE_DOCUMENT function is a SQL function. Because it is
fully hosted and managed by Snowflake, using it requires no setup. Point the PARSE_DOCUMENT function to a stage that contains your
PDF documents to extract text or layout data from them. The following example extracts the text and layout information from the
file document_1.pdf
on the documents
stage in the parse_document
database and demo
schema.
Note
PARSE_DOCUMENT is currently incompatible with custom network policies.
SELECT
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
@parse_document.demo.documents,
'document_1.pdf',
{'mode': 'LAYOUT'}
) AS layout;
PARSE_DOCUMENT supports processing of documents stored in an internal Snowflake stage or in an external stage. In creating your stage, Server Side Encryption is required. Otherwise, PARSE_DOCUMENT returns an error that the provided file isn’t in the expected format or is client-side encrypted.
CREATE STAGE input_stage
DIRECTORY = ( ENABLE = true )
ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );
Input requirements¶
The Cortex PARSE_DOCUMENT function is currently optimized for documents that were created digitally, not scanned from hardcopy. The following table lists the limitations and requirements for the input document:
Maximum file size |
100 MB |
Maximum pages per document |
300 Pages |
Allowed file type |
PDF, PPTX, DOCX |
Stage Encryption |
Server-side encryption |
Note
PARSE_DOCUMENT is not currently optimized for languages that use non-Latin characters, like Chinese, Japanese, and Thai. French, Portuguese, Italian, German, Spanish, Swedish, and Norwegian are supported in preview and are being further optimized.
Key functionality¶
Function |
Description |
---|---|
Page Orientation |
PARSE_DOCUMENT automatically detects page orientation. |
Characters |
PARSE_DOCUMENT detects the following characters:
|
Languages |
PARSE_DOCUMENT is optimized for English. It also supports French, Portuguese, Italian, German, Spanish, Swedish, and Norwegian in preview. |
Regional availability¶
Support for this feature is available to accounts in the following Snowflake regions:
AWS |
Azure |
---|---|
US West 2 (Oregon) |
East US 2 (Virginia) |
US East (Ohio) |
West Europe (Netherlands) |
US East 1 (N. Virginia) |
|
Europe (Ireland) |
|
Europe Central 1 (Frankfurt) |
Access control requirements¶
To use the PARSE_DOCUMENT function, a user with the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See Required privileges topic for details.
Cost considerations¶
The Cortex PARSE_DOCUMENT function incurs compute costs based on the number of pages per document counted.
Snowflake recommends executing queries that call the Cortex PARSE_DOCUMENT function in a smaller warehouse (no larger than MEDIUM). Larger warehouses do not increase performance.
Error conditions¶
Snowflake Cortex PARSE_DOCUMENT can produce the following error messages:
Message |
Explanation |
---|---|
Provided file is not in expected format. Make sure the file is a PDF. |
Returned when the document is not a valid PDF. |
Maximum number of 100 pages exceeded. |
Returned when PDF contains more than 100 pages. |
Maximum file size of 104857600 bytes exceeded. |
Returned when document is larger than 100 MB. |
Provided file cannot be found or accessed. |
File does not exist. |
Internal error. |
A system error occurred. Wait and try again. |
Incorporating PARSE_DOCUMENT into RAG pipelines¶
Retrieval augmented generation (RAG) is a technique for retrieving data from a knowledge base to enhance the generated response of a LLM. The quality and context of the content that is extracted from various documents is foundational to retrieval performance in a document search system. The PARSE_DOCUMENT LAYOUT mode allows you to easily implement advanced content extraction that maintains the structural integrity of a document, allowing you to easily divide text into concise, self-contained units of text. This, in turn, gives you the ability of implementing semantic chunking instead of relying on arbitrary character splits, as well as executing targeted Q&A and summarization.
Legal notices¶
The data classification of inputs and outputs are as set forth in the following table.
Input data classification |
Output data classification |
Designation |
---|---|---|
Usage Data |
Customer Data |
Preview AI Features [1] |
For additional information, refer to Snowflake AI and ML.