The Cortex PARSE_DOCUMENT function

The PARSE_DOCUMENT function is a Cortex AI task-specific function that gives you the ability to extract text or layout from documents stored in an internal or external stage. PARSE_DOCUMENT combines powerful Optical Character Recognition (OCR) capabilities with machine learning models to identify text content, information stored in tables, and the structural elements of PDF documents. You can use the PARSE_DOCUMENT function to extract text and document layout to build information retrieval systems on large archives of business documents, and to load the extracted information into structured Snowflake tables for use by your applications.

How PARSE_DOCUMENT works

The PARSE_DOCUMENT function offers OCR (default), and LAYOUT modes for processing PDF documents.

  • PARSE_DOCUMENT OCR (default) mode is optimized for text extraction from text-heavy documents. This is the recommended option for quick, easy, and effective text extraction from documents that do not have a strong semantic structure.

  • PARSE_DOCUMENT LAYOUT (optional) mode is optimized for extracting text and layout elements like tables. This is the recommended option to improve the context of a document knowledge base to optimize retrieval information systems, and Large Language Model (LLM) inference. For example, you can isolate text sections using LAYOUT elements for more targeted entity extraction tasks.

Using PARSE_DOCUMENT

The Cortex PARSE_DOCUMENT function is a SQL function. Because it is fully hosted and managed by Snowflake, using it requires no setup. This means you can point the PARSE_DOCUMENT function to a stage where PDF documents are stored to extract text or layout data. The following example extracts the text and layout information from the file document_1.pdf on the documents stage in the parse_document database and demo schema.

Note

PARSE_DOCUMENT is currently incompatible with custom network policies.

SELECT
  SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
    @parse_document.demo.documents,
    'document_1.pdf',
    {'mode': 'LAYOUT'}
  ) AS layout;
Copy

PARSE_DOCUMENT supports processing of documents stored in an internal Snowflake stage, or an external stage. In creating your stage, Server Side Encryption is required. Otherwise, PARSE_DOCUMENT will return an error that the provided file isn’t in the expected format or is client-side encrypted.

CREATE STAGE input_stage
    DIRECTORY = ( ENABLE = true )
    ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );
Copy

Input requirements

The Cortex PARSE_DOCUMENT function is currently optimized for documents that were created digitally, not scanned from hardcopy. The following table lists the limitations and requirements for the input document:

Maximum file size

100 MB

Maximum pages per document

300 Pages

Allowed file type

PDF, PPTX, DOCX

Stage Encryption

Server-side encryption

Note

PARSE_DOCUMENT is not currently optimized for languages that use non-Latin characters, like Chinese, Japanese, and Thai. French, Portuguese, Italian, German, Spanish, Swedish, and Norwegian are supported in preview and are being further optimized.

Key functionality

Function

Description

Page Orientation

PARSE_DOCUMENT automatically detects page orientation.

Characters

PARSE_DOCUMENT detects the following characters:

  • a-z

  • A-Z

  • 0-9

  • À Á Â Ä Å Ç È É Ê Ë Ì Í Î Ï Ò Ó Ô Õ Ö Ú Ü Ý ß à á â ã ä å æ ç è é ê ë ì í î ï ñ ò ó ô õ ö ø ù ú û ü ý ą Ć ć Č č Đ đ ę ı Ł ł ō Œ œ Š š Ÿ Ž ž ʒ β δ ε з Ṡ

  • # $ % & ‘ ( ) * + , - . / : ; < = > ? @ [ ] _ ` { } ¡ £ § ª « ° ¹ ² ³ ´ µ · º » ¿ ‘ € ™

Languages

PARSE_DOCUMENT is optimized for English. It also supports French, Portuguese, Italian, German, Spanish, Swedish, and Norwegian in preview.

Regional availability

Support for this feature is available to accounts in the following Snowflake regions:

AWS

Azure

US West 2 (Oregon)

East US 2 (Virginia)

US East (Ohio)

West Europe (Netherlands)

US East 1 (N. Virginia)

Europe (Ireland)

Europe Central 1 (Frankfurt)

Access control requirements

To use the PARSE_DOCUMENT function, a user with the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See Required privileges topic for details.

Cost considerations

The Cortex PARSE_DOCUMENT function function incurs compute costs based on the number of pages per document counted.

Even though there are no costs for compute during the preview, you must choose a warehouse to execute the query that calls the function. Snowflake recommends executing queries that call the Cortex PARSE_DOCUMENT function with a smaller warehouse (no larger than MEDIUM) because larger warehouses do not increase performance.

Error conditions

Snowflake Cortex PARSE_DOCUMENT can produce the following error messages:

Message

Explanation

Provided file is not in expected format. Make sure the file is a PDF.

Returned when the document is not a valid PDF.

Maximum number of 100 pages exceeded.

Returned when PDF contains more than 100 pages.

Maximum file size of 104857600 bytes exceeded.

Returned when document is larger than 100 MB.

Provided file cannot be found or accessed.

File does not exist.

Internal error.

A system error occurred. Wait and try again.

Incorporating PARSE_DOCUMENT into RAG pipelines

Retrieval augmented generation (RAG) is a technique for retrieving data from a knowledge base to enhance the generated response of a LLM. The quality and context of the content that is extracted from various documents is foundational to retrieval performance in a document search system. The PARSE_DOCUMENT LAYOUT mode allows you to easily implement advanced content extraction that maintains the structural integrity of a document, allowing you to easily divide text into concise, self-contained units of text. This, in turn, gives you the ability of implementing semantic chunking instead of relying on arbitrary character splits, as well as executing targeted Q&A and summarization.