PerformOCR 2025.3.28.13-SNAPSHOT¶
BUNDLE¶
com.snowflake.openflow.runtime | runtime-ocr-nar
DESCRIPTION¶
Uses the Openflow Tesseract OCR Service to extract text from a PDF or image, optionally providing metadata including the bounding box, page numberand confidence level of the OCR.
INPUT REQUIREMENT¶
REQUIRED
Supports Sensitive Dynamic Properties¶
false
PROPERTIES¶
Property |
Description |
---|---|
Confidence Threshold |
The minimum confidence level required for a text block to be included in the output. Text blocks with a confidence level below this value will be excluded. |
Extract PDF Text |
If true, the processor will attempt to extract text directly from the PDF files, rather than performing OCR. This can be more efficient and provide better results in many cases. In the case that text is not available in the PDF, OCR will be performed regardless of this setting. |
MIME Type |
The MIME Type of the input FlowFile. This is used to determine the format of the input data. |
OCR Service |
An OCR Service for reading files to output text. |
Record Writer |
Specifies the Controller Service to use for writing the results. If not specified, the results will be written to the FlowFile as plaintext.If the Record Writer is specified, each text block will be output as an individual Record. In this case, the Record will contain not only the textthat was found but also the bounding box in the image/pdf where the text was found, as well as the page number and the confidence level of the OCR.Each Record will have the following fields: |
RELATIONSHIPS¶
NAME |
DESCRIPTION |
---|---|
failure |
If the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship. |
comms.failure |
If the processor is unable to communicate with the Tesseract OCR Service, the input FlowFile will be routed to this relationship. |
success |
The text of the PDF is routed to the success relationship. |
WRITES ATTRIBUTES¶
NAME |
DESCRIPTION |
---|---|
mime.type |
The MIME Type of the FlowFile. |
text.extraction.method |
The method used to extract the text from the FlowFile. This will be either ‘PdfExtraction’ or ‘OCR’. |