Cortex PARSE_DOCUMENT

The PARSE_DOCUMENT function is a Cortex AI task-specific function that provides the ability to extract text, data, and layout elements from documents. You can use PARSE_DOCUMENT to extract text from various documents and forms to implement the following:

  • RAG pipelines powering Cortex Search

  • LLM processing like document summarization or translation using Cortex AI Functions

  • Zero-shot entity extraction using Cortex AI Structured Outputs

How PARSE_DOCUMENT works

The PARSE_DOCUMENT function offers two modes for processing PDF documents:

  • OCR mode is the recommended option for quick, high-quality text extraction from documents such as manuals, agreement contracts, product detail pages, insurance policies and claims, and SharePoint documents.

  • LAYOUT mode is optimized for extracting text and layout elements like tables. This is the recommended option to improve the context of a document knowledge base to optimize retrieval information systems and for Large Language Model (LLM) inference.

Note

PARSE_DOCUMENT layout mode is currently in preview. OCR mode is Generally Available in most regions.

Using PARSE_DOCUMENT

Cortex PARSE_DOCUMENT is a SQL function. Because it is fully hosted and managed by Snowflake, using it requires no setup. Point the PARSE_DOCUMENT function to a stage containing your PDF documents to extract data from them. The following example extracts the text and layout information from the file document_1.pdf on the documents stage in the parse_document database and demo schema.

Note

PARSE_DOCUMENT is currently incompatible with custom network policies.

SELECT
  SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
    @parse_document.demo.documents,
    'document_1.pdf',
    {'mode': 'LAYOUT'}
  ) AS layout;
Copy

Create a stage for document processing

Before you start processing documents, create an internal or external stage storing the relevant documents. When creating your stage, you must Server Side Encryption. Otherwise, PARSE_DOCUMENT returns an error that the provided file isn’t in the expected format or is client-side encrypted.

CREATE STAGE input_stage
    DIRECTORY = ( ENABLE = true )
    ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );
Copy

Note

Processing files from stages is currently incompatible with custom network policies.

Tip

If you run into the error message “Expiry in seconds for AWS role is invalid” error from an external stage, make sure the presigned URL expiry time is set accurately. The default value of this account parameter is optimized for internal stages, but there is an option to adjust it to work with external stages. To change it, contact Snowflake support.

Example

This example uses PARSE_DOCUMENT’s OCR mode to extract text from the first page of a weather insurance document, shown below.

Example weather insurance document

To parse this document, you would upload it to a stage named document_stage and run the following query:

SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
    @document_stage,
    'weather_policy.pdf'
  ) AS weather_policy_doc
Copy

The raw response from PARSE_DOCUMENT is as follows:

{
  "content": "SOME INSURANCE COMPANY\nWEATHER PROTECTION INSURANCE POLICY\nPolicy Number: WP-2025-789456\nEffective Date: April 1, 2025\nExpiration Date: April 1, 2026\nNAMED INSURED AND PROPERTY:\nJohn and Mary Homeowner\n123 Shelter Lane\nWeatherton, ST 12345\n

SECTION I - DEFINITIONS\nThroughout this policy, \"you\" and \"your\" refer to the Named Insured shown in the Declarations and the spouse if a\nresident of the same household. \"We,\" \"us,\" and \"our\" refer to Evergreen Insurance Company providing this\ninsurance. In addition, certain words and phrases are defined as follows:\n1. Weather Event means a natural atmospheric occurrence including but not limited to: a. Wind (including\nhurricanes, tornadoes, and straight-line winds) b. Hail c. Lightning d. Snow, ice, and freezing rain e.\nExcessive rainfall resulting in flooding f. Extreme temperatures causing damage\n2. Named Storm means any storm or weather disturbance that has been declared and named as a tropical\nstorm or hurricane by the National Weather Service or National Hurricane Center.\n3. Actual Cash Value (ACV) means the cost to repair or replace damaged property with new material of like\nkind and quality, less depreciation due to age, wear and condition.\n4. Replacement Cost means the cost to repair or replace damaged property with new material of like kind\nand quality, without deduction for depreciation.\n5. Dwelling means the building structure at the insured location including attached structures and fixtures.\n6. Other Structures means structures on the residence premises separated from the dwelling by clear\nspace or connected only by a fence, utility line, or similar connection.\n7. Personal Property means movable items owned by you and located at the insured property.\n

SECTION II - COVERAGE\nA. PROPERTY COVERAGE\nWe will pay for direct physical loss to property described in the Declarations caused by a Weather Event unless\nthe loss is excluded in Section III - Exclusions.\n1. Dwelling Protection We will cover: a. Your dwelling, including attached structures b. Materials and\nsupplies located on or adjacent to the residence premises for use in construction, alteration, or repair of\nthe dwelling or other structures c. Foundation, floor slab, and footings supporting the dwelling d.\nWall-to-wall carpeting attached to the dwelling\n2. Other Structures Protection We will cover structures on your property separated from your dwelling by\nclear space, including: a. Detached garages b. Storage sheds c. Fences d. Driveways and walkways e.\nPatios and retaining walls\n3. Personal Property Protection We will cover personal property owned or used by you while it is on the\nresidence premises. Coverage includes but is not limited to: a. Furniture b. Clothing c. Electronic\nequipment d. Appliances e. Sporting goods\n4. Loss of Use Protection If a Weather Event makes your residence uninhabitable, we will cover: a.\nAdditional living expenses incurred to maintain your normal standard of living b. Fair rental value if part of\nyour residence is rented to others c. Necessary expenses required to make the residence habitable or\nmove to temporary housing\nB. ADDITIONAL COVERAGES",
  "metadata": {
    "pageCount": 1
  }
}
Copy

By processing this response with Cortex AI LLM Functions you can easily perform unstructured data analysis. The following example demonstrates a simple question-answering task:

SELECT SNOWFLAKE.CORTEX.COMPLETE('claude-3-5-sonnet',
  CONCAT ('Is clothing covered as part of the weather protection insurance policy?',
    TO_VARCHAR(weather_policy_doc))) FROM ocr_example_docs
Copy

Response:

Yes, clothing is covered under the insurance policy. According to Section II - Coverage, Part A.3 (Personal Property Protection), clothing is specifically listed as one of the covered personal property items while it is on the residence premises. The policy states: "We will cover personal property owned or used by you while it is on the residence premises. Coverage includes but is not limited to: a. Furniture b. Clothing c. Electronic equipment d. Appliances e. Sporting goods"

Input requirements

PARSE_DOCUMENT is optimized for documents both digital-born and scanned. The following table lists the limitations and requirements of input documents:

Maximum file size

100 MB

Maximum pages per document

300 Pages

Allowed file type

PDF, PPTX, DOCX, JPEG, JPG, PNG, TIFF, TIF

Stage encryption

Server-side encryption

Font size

8 point or larger for best results

Supported document features

Feature

Description

Page Orientation

PARSE_DOCUMENT automatically detects page orientation.

Characters

PARSE_DOCUMENT detects the following characters:

  • a-z

  • A-Z

  • 0-9

  • À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ą ą Ć ć Č č Đ đ Ę ę ı Ł ł Ń ń ō Œ œ Ś ś Š š Ÿ Ź ź Ż ż Ž ž ʒ β δ ε з Ṡ

  • ! “ # $ % & ‘ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~ ¡ ¢ £ ¥ § © ª « ­ ® ¯ ° ± ² ³ ´ µ ¶ · º » ¿ ‘ † ‡ • ‣ ⁋ ₣ ₤ ₦ ₩ € ₭ ₹ ™ ← ↑ → ↓ ↔ ↕ ↖ ↗ ↘ ↙ ↰ ↱ ↲ ↳ ↴ ↵

Note

PARSE_DOCUMENT is not trained for handwriting recognition.

Supported languages

PARSE_DOCUMENT is deliberately trained for the following languages:

OCR Mode

LAYOUT Mode

  • English

  • French

  • German

  • Italian

  • Norwegian

  • Polish

  • Portuguese

  • Spanish

  • Swedish

  • English

LAYOUT mode also supports, but is not optimized for, French, German, Italian, Norwegian, Polish, Portuguese, Spanish, and Swedish.

Regional availability

Support for this feature is available to accounts in the following Snowflake regions:

AWS

Azure

Google Cloud Platform

US West 2 (Oregon)

East US 2 (Virginia)

US Central 1 (Iowa)

US East (Ohio)

West US 2 (Washington)

US East 1 (N. Virginia)

Europe (Netherlands)

Europe (Ireland)

Europe Central 1 (Frankfurt)

Asia Pacific (Sydney)

Asia Pacific (Tokyo)

Access control requirements

To use the PARSE_DOCUMENT function, a user with the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See Required privileges topic for details.

Cost considerations

The Cortex PARSE_DOCUMENT function incurs compute costs based on the number of pages per document processed.

  • For document file formats (PDF, DOCX), each page in the document is billed as a page.

  • For image file formats (JPEG, JPG, TIF, TIFF, PNG), each individual image file is billed as a page.

Snowflake recommends executing queries that call the Cortex PARSE_DOCUMENT function in a smaller warehouse (no larger than MEDIUM). Larger warehouses do not increase performance.

Error conditions

Snowflake Cortex PARSE_DOCUMENT can produce the following error messages:

Message

Explanation

Document contains language that is not supported.

Input document contains unsupported language

The provided file format {file_extension} isn’t supported. Supported formats: .[‘.docx’, ‘.pptx’, ‘.pdf’].

Returned when the document is not in a supported format.

The provided file format .bin isn’t supported. Supported formats: [‘.docx’, ‘.pptx’, ‘.pdf’]. Ensure the file is stored with server-side encryption.

Returned when the file format is not supported and understood as a binary file. Ensure stage uses server-side encryption.

Maximum number of 300 pages exceeded.

Returned when PDF contains more than 300 pages.

Maximum file size of 104857600 bytes exceeded.

Returned when document is larger than 100 MB.

Provided file cannot be found.

File does not exist.

Provided file cannot be accessed.

File can not be accessed due to insufficient privileges.

The Parse Document function did not respond in the allowed time.

Timeout occurred.

Internal error.

A system error occurred. Wait and try again.