snowflake.snowpark.functions.ai_parse_document¶
- snowflake.snowpark.functions.ai_parse_document(file: Column, **kwargs) Column [source]¶
Returns the extracted content from a document as a JSON-formatted string. This function supports two types of extraction: Optical Character Recognition (OCR), and layout.
- Parameters:
file – A FILE type column containing the document to parse. The document must be on a Snowflake stage that uses server-side encryption and is accessible to the user.
**kwargs –
Configuration settings specified as key/value pairs. Supported keys:
- mode: Specifies the parsing mode. Supported modes are:
’OCR’: The function extracts text only. This is the default mode.
’LAYOUT’: The function extracts layout as well as text, including structural content such as tables.
page_split: If set to True, the function splits the document into pages and processes each page separately. This feature supports only PDF, PowerPoint (.pptx), and Word (.docx) documents. Documents in other formats return an error. The default is False. Tip: To process long documents that exceed the token limit, set this option to True.
- Returns:
A JSON object (as a string) that contains the extracted data and associated metadata. The options argument determines the structure of the returned object.
- If
page_split
is set, the output contains: pages: An array of JSON objects, each containing text extracted from the document.
metadata: Contains metadata about the document, such as page count.
errorInformation: Contains error information if document can’t be parsed (only on error).
- If
page_split
is False or not present, the output contains: content: Plain text (in OCR mode) or Markdown-formatted text (in LAYOUT mode).
metadata: Contains metadata about the document, such as page count.
errorInformation: Contains error information if document can’t be parsed (only on error).
- If
Examples:
>>> import json >>> # Parse a PDF document with default OCR mode >>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect() >>> _ = session.file.put("tests/resources/doc.pdf", "@mystage", auto_compress=False) >>> df = session.range(1).select( ... ai_parse_document(to_file("@mystage/doc.pdf")).alias("parsed_content") ... ) >>> result = json.loads(df.collect()[0][0]) >>> "Sample PDF" in result["content"] True >>> result["metadata"]["pageCount"] 3 >>> # Parse with LAYOUT mode to extract tables and structure >>> _ = session.file.put("tests/resources/invoice.pdf", "@mystage", auto_compress=False) >>> df = session.range(1).select( ... ai_parse_document( ... to_file("@mystage/invoice.pdf"), ... mode='LAYOUT' ... ).alias("parsed_content") ... ) >>> result = json.loads(df.collect()[0][0]) >>> "| Customer Name |" in result["content"] and "| Country |" in result["content"] # Markdown format True >>> # Parse with page splitting for documents >>> df = session.range(1).select( ... ai_parse_document( ... to_file("@mystage/doc.pdf"), ... page_split=True ... ).alias("parsed_content") ... ) >>> result = json.loads(df.collect()[0][0]) >>> len(result["pages"]) 3 >>> 'Sample PDF' in result["pages"][0]["content"] True >>> result["pages"][0]["index"] 0