snowflake.snowpark.DataFrameAIFunctions.parse_document¶
- DataFrameAIFunctions.parse_document(input_column: Union[snowflake.snowpark.column.Column, str], *, output_column: Optional[str] = None, **kwargs) snowflake.snowpark.DataFrame [source]¶
Extract content from a document (OCR or layout parsing) as JSON text.
- Parameters:
input_column – The column (Column object or column name as string) containing FILE references to documents or images on a stage. Use
to_file
to convert staged paths to FILE type.output_column – The name of the output column to be appended. If not provided, a column named
AI_PARSE_DOCUMENT_OUTPUT
is appended.**kwargs – Additional options forwarded to the underlying function, such as
mode
andpage_split
.
Examples:
>>> import json >>> # Parse a PDF document with default OCR mode >>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect() >>> _ = session.file.put("tests/resources/doc.pdf", "@mystage", auto_compress=False) >>> from snowflake.snowpark.functions import col, to_file >>> df = session.create_dataframe([["@mystage/doc.pdf"]], schema=["file_path"]) # staged file path >>> result_df = df.ai.parse_document( ... input_column=to_file(col("file_path")), ... output_column="parsed", ... ) >>> result_df.columns ['FILE_PATH', 'PARSED'] >>> result = json.loads(result_df.collect()[0]["PARSED"]) >>> "Sample PDF" in result["content"] and result["metadata"]["pageCount"] == 3 True >>> # Parse with LAYOUT mode to extract tables and structure >>> _ = session.file.put("tests/resources/invoice.pdf", "@mystage", auto_compress=False) >>> df = session.create_dataframe([["@mystage/invoice.pdf"]], schema=["file_path"]) >>> result_df = df.ai.parse_document( ... input_column=to_file(col("file_path")), ... output_column="parsed", ... mode='LAYOUT', ... ) >>> result = json.loads(result_df.collect()[0]["PARSED"]) >>> "| Customer Name |" in result["content"] and "| Country |" in result["content"] True >>> # Parse with page splitting for long documents (PDF only) >>> df = session.create_dataframe([["@mystage/doc.pdf"]], schema=["file_path"]) >>> result_df = df.ai.parse_document( ... input_column=to_file(col("file_path")), ... output_column="parsed", ... page_split=True, ... ) >>> result = json.loads(result_df.collect()[0]["PARSED"]) >>> len(result["pages"]) == 3 and result["pages"][0]["index"] == 0 True
This function or method is experimental since 1.39.0.