Process unstructured data with UDF and procedure handlers¶
This topic provides examples of reading and processing unstructured data in staged files with handler code written for the following:
You can also read a file with handlers written in other languages:
- Python:
- Scala:
Note
To make your code resilient to file injection attacks, always use a scoped URL when passing a file’s location to a UDF, particularly when the function’s caller is not also its owner. You can create a scoped URL in SQL using the built-in function BUILD_SCOPED_FILE_URL. For more information about what the BUILD_SCOPED_FILE_URL does, see Introduction to unstructured data.
Process a PDF with a UDF and procedure¶
The examples in this section process staged unstructured files using Java handler code – first with a UDF, then with a procedure. Both handlers extract the contents of a specified PDF file using the Apache PDFBox library.
The handler code is very similar between the UDF and procedure. They differ in how the read the incoming PDF file.
In the UDF, the handler reads the file using a Java
InputStream.In the procedure, the handler reads the file using a Snowflake
SnowflakeFile.
The examples use in-line handler code (as opposed to compiled in a staged JAR), which means that you do not need to compile, package, and upload the handler code to a stage. For more information on the difference between in-line and staged handlers, see Keeping handler code in-line or on a stage.
Download the PDFBox library¶
Before you begin writing the UDF, download the PDFBox library JAR file if you don’t have it already. It will be a dependency for your handler code. You’ll later upload the library JAR file to a stage.
Download the latest released version of the library from the Apache PDFBox library download page.
Create stages¶
Create stages in which to keep your handler code’s dependency libraries and the data file the handler code will read.
Using the code below, you’ll create separate internal stages to hold:
A library JAR file that’s a dependency for your handler. You’ll reference the stage and JAR file from the UDF.
A data file that your handler code will read.
Code in the following example uses the CREATE STAGE command to create the stages you’ll need.
Upload the required library and the PDF file to read¶
Complete the following steps to upload the dependency JAR file (with the library code that processes the PDF) and the data file (the PDF file the handler code will process).
You can use the PDF file of your choosing in this example.
Copy the JAR file for Apache PDFBox from the local temporary directory to the stage that stores JAR files:
- Linux/Mac:
- Windows:
Copy the PDF file from the local temporary directory to the stage that stores data files:
- Linux/Mac:
- Windows:
Create and call the UDF¶
Complete the following steps to create a UDF that reads and processes PDF files.
Paste and run the following code to create a UDF.
This UDF’s handler parses PDF documents and retrieves their content. The handler uses the
InputStreamclass to read the file. For more on reading files withInputStream, refer to Reading a dynamically-specified file with InputStream.Refresh the directory table for the
data_stagestage with the ALTER STAGE command:Call the UDF to read the staged PDF file and extract the content.
Code in the following example calls the UDF, passing a scoped URL to make the code resilient to file injection attacks. Always use a scoped URL when the function’s caller is not also its owner. You can pass the URL argument as a scoped URL or another form when the UDF’s caller is also its owner.
Create and call the procedure¶
Complete the following steps to create a procedure that reads and processes PDF files.
Paste and run the following code to create a procedure.
This procedure’s handler parses PDF documents and retrieves their content. The handler uses the
SnowflakeFileclass to read the file. For more on reading files withSnowflakeFile, refer to Reading a dynamically-specified file with SnowflakeFile.Refresh the directory table for the
data_stagestage with the ALTER STAGE command:Call the procedure to read the staged PDF file and extract the content.
Code in the following example passes a scoped URL pointing to the PDF file on the stage you created.
Process a CSV with a UDTF¶
The example in this section extracts and returns data from staged files using Java UDTFs.
Create data stage¶
Create a stage using the CREATE STAGE command:
The following SQL statement creates an internal stages to store the data files for the example:
Upload the CSV file to read¶
Copy the CSV file from the local temporary directory to the stage that stores data files:
- Linux/Mac:
- Windows:
Create and call the UDTF¶
This example extracts the contents of a specified set of CSV files and returns the rows in a table. By processing file data as it’s read from the source, you can avoid potential out-of-memory errors that might arise when the file is very large.
Code in the following UDTF handler example uses SnowflakeFile to generate an InputStream from a file URL to read a CSV
file. (In a Java UDTF handler, row processing begins when Snowflake calls the process method you implement.) The code uses the
stream when constructing an instance of a CsvStreamingReader class defined in the handler itself.
The CsvStreamingReader class reads the contents of the received CSV file stream row by row, providing a way for other code to
retrieve each row as a record where commas delimit columns. The process method returns each record as it is read from the stream.
For more about writing tabular user-defined functions (UDTFs) with a Java handler, see Tabular Java UDFs (UDTFs).
Complete the following steps to create the Java UDTF and upload the required files:
Create a Java UDTF that uses the
SnowflakeFileclass:Refresh the directory table for the
data_stagestage:Call the Java UDTF to read one or more staged CSV files and extract the contents in a table format:
Code in the following example calls the UDF, passing a scoped URL to reduce the risk of file injection attacks. Always used a scoped URL when the function’s caller is not also its owner. You can pass the URL argument as a scoped URL or another supported form when the UDF’s caller is also its owner.