- Categories:
String & binary functions (Large Language Model)
PARSE_DOCUMENT (SNOWFLAKE.CORTEX)¶
Returns the extracted content from a document on a Snowflake stage as an OBJECT that contains JSON-encoded objects as strings. This function supports 2 types of extractions, Optical Character Recognition (OCR) and layout. To learn more, see The Cortex PARSE_DOCUMENT function.
Syntax¶
SNOWFLAKE.CORTEX.PARSE_DOCUMENT( '@<stage>', '<path>', [ { 'mode': '<mode>' }, ] )
Arguments¶
Required:
stage
Name of the Snowflake stage.
path
Relative path to the document on the Snowflake stage.
Optional:
mode
Returns a value of the type OBJECT. In the object, the value for the key
content
contains the extracted data as a JSON-encoded string. The data can either be formatted or in plain text, depending on the mode specified in the call:If
mode
isLAYOUT
, the data is markdown with structural content including tables.If
mode
isOCR
, the data is the text content.
Default:
'OCR'
Returns¶
An OBJECT data type that contains the extracted data. The content depends on the mode used in the call:
LAYOUT mode: JSON with key “content” containing markdown with tables extracted from the document.
OCR mode: JSON with key “content” containing the text content from the document.
Examples¶
OCR mode¶
SELECT TO_VARCHAR(
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
'@PARSE_DOCUMENT.DEMO.documents',
'document_1.pdf',
{'mode': 'OCR'}):content
) AS OCR;
Output:
{
"content of the document"
}
LAYOUT mode¶
This example parses a document with a table shown in the following screenshot:
SELECT
TO_VARCHAR (
SNOWFLAKE.CORTEX.PARSE_DOCUMENT (
'@PARSE_DOCUMENT.DEMO.documents',
'document_1.pdf',
{'mode': 'LAYOUT'} ):content ) AS LAYOUT;
Output:
{
"content": "# This is PARSE DOCUMENT example
Example table:
|Header|Second header|Third Header|
|:---:|:---:|:---:|
|First row header|Data in first row|Data in first row|
|Second row header|Data in second row.|Data in second row.|
Some more text."
}