Processing Unstructured Data Using Java UDFs or UDTFs¶
This topic provides examples of reading and processing unstructured data in staged files using Java UDFs (user-defined functions) or tabular Java UDFs (user-defined table functions).
For more information on using Java to develop UDF handlers, see the following:
For more information on reading from a file with Java UDF handler code, see Reading a File Using IMPORTS.
In this Topic:
Java UDF Examples¶
The example in this section processes staged unstructured files using Java UDFs, which extract and return text from the files.
Prerequisites: Create Stages¶
The examples use in-line Java UDFs (as opposed to pre-compiled Java UDFs), which means that you do not need to compile, package, and upload the Java code for your UDF to a stage.
However, the examples depend on a separate library that is packaged in a JAR file. You must upload the JAR file for that library to a stage. The examples use an internal stage to store the JAR file for this library.
Although the unstructured data files processed by a Java UDF can be located in the same stage as the JAR files, in these examples, the data files are located in a separate internal stage.
Create the stages using a role that has the minimum required privileges, as described in Granting Privileges for User-Defined Functions.
The following SQL statements create separate internal stages to separately store the JAR files and data files for the examples:
-- Create an internal stage to store the JAR files.
CREATE OR REPLACE STAGE jars_stage;
-- Create an internal stage to store the data files. The stage includes a directory table.
CREATE OR REPLACE STAGE data_stage DIRECTORY=(ENABLE=TRUE) ENCRYPTION = (TYPE='SNOWFLAKE_SSE');
Process PDF Files¶
This example extracts the contents of a specified PDF file using Apache PDFBox.
Complete the following steps to create the Java UDF and upload the require files:
Copy the JAR file for Apache PDFBox from the local temporary directory to the stage that stores JAR files:
- Linux/Mac
PUT file:///tmp/pdfbox-app-2.0.25.jar @jars_stage AUTO_COMPRESS=FALSE;
- Windows
PUT file://C:\temp\pdfbox-app-2.0.25.jar @jars_stage AUTO_COMPRESS=FALSE;
Create a Java UDF to parse PDF documents and retrieve the content from each document. You can use either the
SnowflakeFile
class or theInputStream
class in your UDF code:- Using SnowflakeFile Class
CREATE FUNCTION process_pdf(file string) RETURNS string LANGUAGE java RUNTIME_VERSION = 11 IMPORTS = ('@jars_stage/pdfbox-app-2.0.25.jar') HANDLER = 'PdfParser.readFile' as $$ import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.apache.pdfbox.text.PDFTextStripperByArea; import com.snowflake.snowpark_java.types.SnowflakeFile; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; public class PdfParser { public static String readFile(String fileURL) throws IOException { SnowflakeFile file = SnowflakeFile.newInstance(fileURL); try (PDDocument document = PDDocument.load(file.getInputStream())) { document.getClass(); if (!document.isEncrypted()) { PDFTextStripperByArea stripper = new PDFTextStripperByArea(); stripper.setSortByPosition(true); PDFTextStripper tStripper = new PDFTextStripper(); String pdfFileInText = tStripper.getText(document); return pdfFileInText; } } return null; } } $$;
- Using InputStream Class
CREATE FUNCTION process_pdf(file string) RETURNS string LANGUAGE java RUNTIME_VERSION = 11 IMPORTS = ('@jars_stage/pdfbox-app-2.0.25.jar') HANDLER = 'PdfParser.readFile' as $$ import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.apache.pdfbox.text.PDFTextStripperByArea; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; public class PdfParser { public static String readFile(InputStream stream) throws IOException { try (PDDocument document = PDDocument.load(stream)) { document.getClass(); if (!document.isEncrypted()) { PDFTextStripperByArea stripper = new PDFTextStripperByArea(); stripper.setSortByPosition(true); PDFTextStripper tStripper = new PDFTextStripper(); String pdfFileInText = tStripper.getText(document); return pdfFileInText; } } return null; } } $$;
Copy the PDF file from the local temporary directory to the stage that stores data files:
- Linux/Mac
PUT file:///tmp/myfile.pdf @data_stage AUTO_COMPRESS=FALSE;
- Windows
PUT file://C:\temp\myfile.pdf @data_stage AUTO_COMPRESS=FALSE;
Refresh the directory table for the
data_stage
stage:ALTER STAGE data_stage REFRESH;
Call the Java UDF to read one or more staged PDF files and extract the content:
-- Input a stage name and file path.
SELECT process_pdf('@data_stage/myfile.pdf');
-- Input a file URL generated by the BUILD_STAGE_FILE_URL function.
SELECT process_pdf(build_stage_file_url('@data_stage', '/myfile.pdf'));
-- Input a file URL output from the BUILD_STAGE_FILE_URL function.
SELECT process_pdf('https://myaccount.snowflakecomputing.com/api/files/mydb/myschema/data_stage/myfile.pdf');
-- Input a scoped URL.
SELECT process_pdf(build_scoped_file_url('@data_stage', '/myfile.pdf'));
-- Process all of the PDF files in a directory table serially.
SELECT process_pdf(file_url)
FROM directory(@data_stage);
-- Process all of the PDF files in a directory table in parallel.
SELECT process_pdf(file_url)
FROM (
SELECT file_url
FROM directory(@data_stage)
GROUP BY file_url
);
Java UDTF Examples¶
The example in this section extracts and returns data from staged files using Java UDTFs.
Prerequisites: Create Data Stage¶
Create a stage to store your data files using a role that has the minimum required privileges, as described in Granting Privileges for User-Defined Functions.
The following SQL statement creates an internal stages to store the data files for the example:
-- Create an internal stage to store the data files. The stage includes a directory table.
CREATE OR REPLACE STAGE data_stage DIRECTORY=(ENABLE=TRUE) ENCRYPTION = (TYPE='SNOWFLAKE_SSE');
Process CSV Files¶
This example extracts the contents of a specified set of CSV files and returns the rows in a table.
Complete the following steps to create the Java UDTF and upload the require files:
Create a Java UDTF that uses the
SnowflakeFile
class:CREATE OR REPLACE FUNCTION parse_csv(file string) RETURNS TABLE (col1 string, col2 string, col3 string ) LANGUAGE JAVA HANDLER = 'CsvParser' as $$ import org.xml.sax.SAXException; import java.io.*; import java.util.ArrayList; import java.util.List; import java.util.stream.Stream; import com.snowflake.snowpark_java.types.SnowflakeFile; public class CsvParser { public class Record { public String col1; public String col2; public String col3; public Record(String col1_value, String col2_value, String col3_value) { col1 = col1_value; col2 = col2_value; col3 = col3_value; } } public static Class getOutputClass() { return Record.class; } public Stream<Record> process(String file_url) throws IOException { SnowflakeFile file = SnowflakeFile.newInstance(file_url); String csvRecord = null; List<Record> rows = new ArrayList<>(); BufferedReader csvReader = null; try { csvReader = new BufferedReader(new InputStreamReader(file.getInputStream())); while ((csvRecord = csvReader.readLine()) != null) { String[] columns = csvRecord.split(",", 3); rows.add(new Record(columns[0], columns[1], columns[2])); } } catch (IOException e) { throw new RuntimeException("Reading CSV failed.", e); } finally { if (csvReader != null) try { csvReader.close(); } catch (IOException e) { e.printStackTrace(); } } return rows.stream(); } } $$ ;
Copy the PDF file from the local temporary directory to the stage that stores data files:
- Linux/Mac
PUT file:///tmp/sample.pdf @data_stage AUTO_COMPRESS=FALSE;
- Windows
PUT file://C:\temp\sample.pdf @data_stage AUTO_COMPRESS=FALSE;
Refresh the directory table for the
data_stage
stage:ALTER STAGE data_stage REFRESH;
Call the Java UDTF to read one or more staged CSV files and extract the contents in a table format:
-- Input a file URL.
SELECT * FROM TABLE(PARSE_CSV(BUILD_STAGE_FILE_URL(@data_stage, 'sample.csv')));