ParsePdfDocument 2025.3.28.13-SNAPSHOT¶

BUNDLE¶

com.snowflake.openflow.runtime | runtime-pdf-nar

DESCRIPTION¶

Parses a PDF file, extracting the text and additional information into a structured JSON document. Additionally, any images or tables found in the document are extracted and routed to separate relationships. Their contents are optionally stored in the Document JSON itself, either as a Base64 encoded image or as text extracted from the image (or both). This Processor extracts the information in such a manner as to preserve the original document’s layout, including the hierarchy of the sections within the document.

TAGS¶

document, extract, image, ocr, openflow, parse, pdf, rag, retrieval augmented generation, table, text, unstructured

INPUT REQUIREMENT¶

REQUIRED

Supports Sensitive Dynamic Properties¶

false

PROPERTIES¶

Property	Description
Communication Timeout	The amount of time to wait for a response from the microservices before timing out.
Custom Element Detection Service URL	The Custom URL to the Openflow Document Element Detection Service.
Image Embedding Strategy	When an image is found in the document, this property specifies how the image should be embedded into the document.
OCR Service	An OCR Service for reading files to output text.
Service Location Strategy	Determines how Service Locations are configured within this processor for the Openflow Document Element Detection Service.
Table Embedding Strategy	When a table is found in the document, this property specifies how the table should be embedded into the document.

RELATIONSHIPS¶

NAME	DESCRIPTION
images	If an image is found in the document, the image will be routed to this relationship.
tables	If a table is found in the document, the image of the table will be routed to this relationship.
success	The text of the PDF is routed to the success relationship.
failure	If the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship.
comms.failure	If the processor is unable to communicate with one of the necessary services, the input FlowFile will be routed to this relationship.

WRITES ATTRIBUTES¶

NAME	DESCRIPTION
mime.type	The MIME type is set to ‘application/json’ for the JSON document, ‘image/png’ for any extracted images.
page.count	The number of pages in the PDF file is added to the JSON document.
container.scope	The scope of the container is set to DOCUMENT for the JSON Document, TABLE for tables, and FIGURE for any figures/images identified
document.id	A unique UUID for the document
fragment.index	The index of the fragment
fragment.count	The total number of fragments

ParsePdfDocument 2025.3.28.13-SNAPSHOT¶

BUNDLE¶

DESCRIPTION¶

TAGS¶

INPUT REQUIREMENT¶

Supports Sensitive Dynamic Properties¶

PROPERTIES¶

RELATIONSHIPS¶

WRITES ATTRIBUTES¶

SEE ALSO¶