Categories:

String & binary functions (Large Language Model)

PARSE_DOCUMENT (SNOWFLAKE.CORTEX)¶

Returns the extracted content from a document on a Snowflake stage as an OBJECT that contains JSON-encoded objects as strings. This function supports 2 types of extractions, Optical Character Recognition (OCR), and layout. To learn more, see Cortex PARSE_DOCUMENT.

Syntax¶

SNOWFLAKE.CORTEX.PARSE_DOCUMENT( '@<stage>', '<path>', [ { 'mode': '<mode>' }, ] )
Copy

Arguments¶

Required:

stage

Name of the Snowflake stage.

path

Relative path to the document on the Snowflake stage.

Optional:

mode

Returns a value of the type OBJECT. In the object, the value for the key content contains the extracted data as a JSON-encoded string. The data can either be formatted or in plain text, depending on the mode specified in the call:

  • If mode is LAYOUT, the data is markdown with structural content including tables.

  • If mode is OCR, the data is the text content.

Default: 'OCR'

Returns¶

An OBJECT data type that contains the extracted data. The content depends on the mode used in the call:

  • OCR mode (default): JSON (as string) with the keys described below.

    • "content": Text extracted from the document.

    • "errorInformation": Contains error information if extraction fails.

  • LAYOUT mode (preview): JSON (as string) with the keys described below.

    • "content": Markdown-formatted text with tables extracted from the document.

    • "errorInformation": Contains error information if extraction fails.

Examples¶

OCR mode¶

SELECT TO_VARCHAR(
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
        '@PARSE_DOCUMENT.DEMO.documents',
        'document_1.pdf',
        {'mode': 'OCR'})
    ) AS OCR;
Copy

Output:

{
    "content": "content of the document"
}

LAYOUT mode¶

This example parses a document with a table shown in the following screenshot:

Example PDF content with a table
SELECT
  TO_VARCHAR (
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT (
        '@PARSE_DOCUMENT.DEMO.documents',
        'document_1.pdf',
        {'mode': 'LAYOUT'} ) ) AS LAYOUT;
Copy

Output:

{
  "content": "# This is PARSE DOCUMENT example
     Example table:
     |Header|Second header|Third Header|
     |:---:|:---:|:---:|
     |First row header|Data in first row|Data in first row|
     |Second row header|Data in second row|Data in second row|

     Some more text."
 }

Limitations¶

Snowflake Cortex functions doesn’t support dynamic table incremental refresh.