Cortex PARSE_DOCUMENT¶

The PARSE_DOCUMENT function is a Cortex AI AISQL functions that provides the ability to extract text, data, and layout elements from documents. You can use PARSE_DOCUMENT to extract text from various documents and forms to implement the following:

RAG pipelines powering Cortex Search
LLM processing like document summarization or translation using Cortex AI Functions
Zero-shot entity extraction using Cortex AI Structured Outputs

How PARSE_DOCUMENT works¶

The PARSE_DOCUMENT function offers two modes for processing PDF documents:

OCR mode is the recommended option for quick, high-quality text extraction from documents such as manuals, agreement contracts, product detail pages, insurance policies and claims, and SharePoint documents.
LAYOUT mode is optimized for extracting text and layout elements like tables. This is the recommended option to improve the context of a document knowledge base to optimize retrieval information systems and for Large Language Model (LLM) inference.

Note

PARSE_DOCUMENT layout mode is currently in preview. OCR mode is Generally Available in most regions.

Additionally, for PDF, PowerPoint, and Word documents, PARSE_DOCUMENT can split multi-page documents into separate pages, which are processed separately. This is useful for processing large documents or extracting information from specific pages.

Using PARSE_DOCUMENT¶

Cortex PARSE_DOCUMENT is a SQL function. Because it is fully hosted and managed by Snowflake, using it requires no setup. Point the PARSE_DOCUMENT function to a stage containing your PDF documents to extract data from them. The following example extracts the text and layout information from the file document_1.pdf on the documents stage in the parse_document database and demo schema.

Note

PARSE_DOCUMENT is currently incompatible with custom network policies.

SELECT
  SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
    @parse_document.demo.documents,
    'document_1.pdf',
    {'mode': 'LAYOUT'}
  ) AS layout;

Copy

Create a stage for document processing¶

Create an internal or external stage to store the documents you want to process. When creating your stage, enable Server Side Encryption. Otherwise, PARSE_DOCUMENT returns an error that the provided file isn’t in the expected format or is client-side encrypted.

The SQL below creates a suitable internal stage.

CREATE OR REPLACE STAGE input_stage
    DIRECTORY = ( ENABLE = true )
    ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );

Copy

The following SQL creates an external stage on Amazon S3. Azure and GCP external stages are also supported.

CREATE OR REPLACE STAGE input_stage
    URL='s3://<s3-path>/'
    CREDENTIALS=(AWS_KEY_ID=<aws_key_id>
    AWS_SECRET_KEY=<aws_secret_key>)
    ENCRYPTION=( TYPE = 'AWS_SSE_S3' );

Copy

Note

Processing files from stages is currently incompatible with custom network policies.

Tip

If you run into the error message “Expiry in seconds for AWS role is invalid” error from an external stage, make sure the presigned URL expiry time is set accurately. The default value of this account parameter is optimized for internal stages, but there is an option to adjust it to work with external stages. To change it, contact Snowflake support.

Example¶

This example uses PARSE_DOCUMENT’s OCR mode to extract text from the first page of a weather insurance document, shown below.

To parse this document, you would upload it to a stage named document_stage and run the following query:

SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
    @document_stage,
    'weather_policy.pdf'
  ) AS weather_policy_doc

Copy

The raw response from PARSE_DOCUMENT is as follows:

{
  "content": "SOME INSURANCE COMPANY\nWEATHER PROTECTION INSURANCE POLICY\nPolicy Number: WP-2025-789456\nEffective Date: April 1, 2025\nExpiration Date: April 1, 2026\nNAMED INSURED AND PROPERTY:\nJohn and Mary Homeowner\n123 Shelter Lane\nWeatherton, ST 12345\n

SECTION I - DEFINITIONS\nThroughout this policy, \"you\" and \"your\" refer to the Named Insured shown in the Declarations and the spouse if a\nresident of the same household. \"We,\" \"us,\" and \"our\" refer to Evergreen Insurance Company providing this\ninsurance. In addition, certain words and phrases are defined as follows:\n1. Weather Event means a natural atmospheric occurrence including but not limited to: a. Wind (including\nhurricanes, tornadoes, and straight-line winds) b. Hail c. Lightning d. Snow, ice, and freezing rain e.\nExcessive rainfall resulting in flooding f. Extreme temperatures causing damage\n2. Named Storm means any storm or weather disturbance that has been declared and named as a tropical\nstorm or hurricane by the National Weather Service or National Hurricane Center.\n3. Actual Cash Value (ACV) means the cost to repair or replace damaged property with new material of like\nkind and quality, less depreciation due to age, wear and condition.\n4. Replacement Cost means the cost to repair or replace damaged property with new material of like kind\nand quality, without deduction for depreciation.\n5. Dwelling means the building structure at the insured location including attached structures and fixtures.\n6. Other Structures means structures on the residence premises separated from the dwelling by clear\nspace or connected only by a fence, utility line, or similar connection.\n7. Personal Property means movable items owned by you and located at the insured property.\n

SECTION II - COVERAGE\nA. PROPERTY COVERAGE\nWe will pay for direct physical loss to property described in the Declarations caused by a Weather Event unless\nthe loss is excluded in Section III - Exclusions.\n1. Dwelling Protection We will cover: a. Your dwelling, including attached structures b. Materials and\nsupplies located on or adjacent to the residence premises for use in construction, alteration, or repair of\nthe dwelling or other structures c. Foundation, floor slab, and footings supporting the dwelling d.\nWall-to-wall carpeting attached to the dwelling\n2. Other Structures Protection We will cover structures on your property separated from your dwelling by\nclear space, including: a. Detached garages b. Storage sheds c. Fences d. Driveways and walkways e.\nPatios and retaining walls\n3. Personal Property Protection We will cover personal property owned or used by you while it is on the\nresidence premises. Coverage includes but is not limited to: a. Furniture b. Clothing c. Electronic\nequipment d. Appliances e. Sporting goods\n4. Loss of Use Protection If a Weather Event makes your residence uninhabitable, we will cover: a.\nAdditional living expenses incurred to maintain your normal standard of living b. Fair rental value if part of\nyour residence is rented to others c. Necessary expenses required to make the residence habitable or\nmove to temporary housing\nB. ADDITIONAL COVERAGES",
  "metadata": {
    "pageCount": 1
  }
}

Copy

You can perform analysis on this unstructured data by processing the response further with Cortex AI LLM Functions. The following example demonstrates a simple question-answering task:

SELECT SNOWFLAKE.CORTEX.COMPLETE('claude-3-5-sonnet',
  CONCAT ('Is clothing covered as part of the weather protection insurance policy?',
    TO_VARCHAR(weather_policy_doc))) FROM ocr_example_docs

Copy

Response:

Yes, clothing is covered under the insurance policy. According to Section II - Coverage, Part A.3 (Personal Property Protection), clothing is specifically listed as one of the covered personal property items while it is on the residence premises. The policy states: "We will cover personal property owned or used by you while it is on the residence premises. Coverage includes but is not limited to: a. Furniture b. Clothing c. Electronic equipment d. Appliances e. Sporting goods"

Input requirements¶

PARSE_DOCUMENT is optimized for documents both digital-born and scanned. The following table lists the limitations and requirements of input documents:

Maximum file size	100 MB
Maximum pages per document	300 Pages
Allowed file type	PDF, PPTX, DOCX, JPEG, JPG, PNG, TIFF, TIF
Stage encryption	Server-side encryption
Font size	8 point or larger for best results

Supported document features¶

Feature	Description
Page Orientation	PARSE_DOCUMENT automatically detects page orientation.
Characters	PARSE_DOCUMENT detects the following characters: a-z A-Z 0-9 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ą ą Ć ć Č č Đ đ Ę ę ı Ł ł Ń ń ō Œ œ Ś ś Š š Ÿ Ź ź Ż ż Ž ž ʒ β δ ε з Ṡ ! “ # $ % & ‘ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { \| } ~ ¡ ¢ £ ¥ § © ª « ® ¯ ° ± ² ³ ´ µ ¶ · º » ¿ ‘ † ‡ • ‣ ⁋ ₣ ₤ ₦ ₩ € ₭ ₹ ™ ← ↑ → ↓ ↔ ↕ ↖ ↗ ↘ ↙ ↰ ↱ ↲ ↳ ↴ ↵

Feature

Description

Page Orientation

PARSE_DOCUMENT automatically detects page orientation.

Characters

PARSE_DOCUMENT detects the following characters:

a-z
A-Z
0-9
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ą ą Ć ć Č č Đ đ Ę ę ı Ł ł Ń ń ō Œ œ Ś ś Š š Ÿ Ź ź Ż ż Ž ž ʒ β δ ε з Ṡ
! “ # $ % & ‘ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~ ¡ ¢ £ ¥ § © ª « ® ¯ ° ± ² ³ ´ µ ¶ · º » ¿ ‘ † ‡ • ‣ ⁋ ₣ ₤ ₦ ₩ € ₭ ₹ ™ ← ↑ → ↓ ↔ ↕ ↖ ↗ ↘ ↙ ↰ ↱ ↲ ↳ ↴ ↵

Note

PARSE_DOCUMENT is not trained for handwriting recognition.

Supported languages¶

PARSE_DOCUMENT is deliberately trained for the following languages:

OCR Mode	LAYOUT Mode
English French German Italian Norwegian Polish Portuguese Spanish Swedish	English

LAYOUT mode also supports, but is not optimized for, French, German, Italian, Norwegian, Polish, Portuguese, Spanish, and Swedish.

Regional availability¶

Support for this feature is available to accounts in the following Snowflake regions:

AWS	Azure	Google Cloud Platform
US West 2 (Oregon)	East US 2 (Virginia)	US Central 1 (Iowa)
US East (Ohio)	West US 2 (Washington)
US East 1 (N. Virginia)	Europe (Netherlands)
Europe (Ireland)
Europe Central 1 (Frankfurt)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)

Access control requirements¶

To use the PARSE_DOCUMENT function, a user with the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See Required privileges topic for details.

Cost considerations¶

The Cortex PARSE_DOCUMENT function incurs compute costs based on the number of pages per document processed.

For document file formats (PDF, DOCX), each page in the document is billed as a page.
For image file formats (JPEG, JPG, TIF, TIFF, PNG), each individual image file is billed as a page.

Snowflake recommends executing queries that call the Cortex PARSE_DOCUMENT function in a smaller warehouse (no larger than MEDIUM). Larger warehouses do not increase performance.

Error conditions¶

Snowflake Cortex PARSE_DOCUMENT can produce the following error messages:

Message	Explanation
Document contains language that is not supported.	Input document contains unsupported language
The provided file format {file_extension} isn’t supported. Supported formats: .[‘.docx’, ‘.pptx’, ‘.pdf’].	Returned when the document is not in a supported format.
The provided file format .bin isn’t supported. Supported formats: [‘.docx’, ‘.pptx’, ‘.pdf’]. Ensure the file is stored with server-side encryption.	Returned when the file format is not supported and understood as a binary file. Ensure stage uses server-side encryption.
Maximum number of 300 pages exceeded.	Returned when PDF contains more than 300 pages.
Maximum file size of 104857600 bytes exceeded.	Returned when document is larger than 100 MB.
Provided file cannot be found.	File does not exist.
Provided file cannot be accessed.	File can not be accessed due to insufficient privileges.
The Parse Document function did not respond in the allowed time.	Timeout occurred.
Internal error.	A system error occurred. Wait and try again.

Legal notices¶

The data classification of inputs and outputs are as set forth in the following table.

Input data classification	Output data classification	Designation
Usage Data	Customer Data	Generally available functions are Covered AI Features. Preview functions are Preview AI Features. [1]

For additional information, refer to Snowflake AI and ML.