Processing Unstructured Data Using Java UDFs or UDTFs

This topic provides examples of reading and processing unstructured data in staged files using Java UDFs (user-defined functions) or tabular Java UDFs (user-defined table functions).

For more information on using Java to develop UDF handlers, see the following:

For more information on reading from a file with Java UDF handler code, see Reading a File with a Java UDF Handler.

In this Topic:

Java UDF Examples

The example in this section processes staged unstructured files using Java UDFs, which extract and return text from the files.

Prerequisites: Create Stages

The examples use in-line Java UDFs (as opposed to pre-compiled Java UDFs), which means that you do not need to compile, package, and upload the Java code for your UDF to a stage.

However, the examples depend on a separate library that is packaged in a JAR file. You must upload the JAR file for that library to a stage. The examples use an internal stage to store the JAR file for this library.

Although the unstructured data files processed by a Java UDF can be located in the same stage as the JAR files, in these examples, the data files are located in a separate internal stage.

Create the stages using a role that has the minimum required privileges, as described in Granting Privileges for User-Defined Functions.

The following SQL statements create separate internal stages to separately store the JAR files and data files for the examples:

-- Create an internal stage to store the JAR files.
CREATE OR REPLACE STAGE jars_stage;

-- Create an internal stage to store the data files. The stage includes a directory table.
CREATE OR REPLACE STAGE data_stage DIRECTORY=(ENABLE=TRUE) ENCRYPTION = (TYPE='SNOWFLAKE_SSE');

Process PDF Files

This example extracts the contents of a specified PDF file using Apache PDFBox.

Complete the following steps to create the Java UDF and upload the require files:

  1. Copy the JAR file for Apache PDFBox from the local temporary directory to the stage that stores JAR files:

    Linux/Mac
    PUT file:///tmp/pdfbox-app-2.0.25.jar @jars_stage AUTO_COMPRESS=FALSE;
    
    Windows
    PUT file://C:\temp\pdfbox-app-2.0.25.jar @jars_stage AUTO_COMPRESS=FALSE;
    
  2. Create a Java UDF to parse PDF documents and retrieve the content from each document. You can use either the SnowflakeFile class or the InputStream class in your UDF code:

    Using SnowflakeFile Class
    CREATE FUNCTION process_pdf(file string)
    RETURNS string
    LANGUAGE java
    RUNTIME_VERSION = 11
    IMPORTS = ('@jars_stage/pdfbox-app-2.0.25.jar')
    HANDLER = 'PdfParser.readFile'
    as
    $$
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.PDFTextStripperByArea;
    import com.snowflake.snowpark_java.types.SnowflakeFile;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    
    public class PdfParser {
    
        public static String readFile(String fileURL) throws IOException {
            SnowflakeFile file = SnowflakeFile.newInstance(fileURL);
            try (PDDocument document = PDDocument.load(file.getInputStream())) {
    
                document.getClass();
    
                if (!document.isEncrypted()) {
    
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);
    
                    PDFTextStripper tStripper = new PDFTextStripper();
    
                    String pdfFileInText = tStripper.getText(document);
                    return pdfFileInText;
                }
            }
    
            return null;
        }
    }
    $$;
    
    Using InputStream Class
    CREATE FUNCTION process_pdf(file string)
    RETURNS string
    LANGUAGE java
    RUNTIME_VERSION = 11
    IMPORTS = ('@jars_stage/pdfbox-app-2.0.25.jar')
    HANDLER = 'PdfParser.readFile'
    as
    $$
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.PDFTextStripperByArea;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    
    public class PdfParser {
    
        public static String readFile(InputStream stream) throws IOException {
            try (PDDocument document = PDDocument.load(stream)) {
    
                document.getClass();
    
                if (!document.isEncrypted()) {
    
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);
    
                    PDFTextStripper tStripper = new PDFTextStripper();
    
                    String pdfFileInText = tStripper.getText(document);
                    return pdfFileInText;
                }
            }
            return null;
        }
    }
    $$;
    
  3. Copy the PDF file from the local temporary directory to the stage that stores data files:

    Linux/Mac
    PUT file:///tmp/myfile.pdf @data_stage AUTO_COMPRESS=FALSE;
    
    Windows
    PUT file://C:\temp\myfile.pdf @data_stage AUTO_COMPRESS=FALSE;
    
  4. Refresh the directory table for the data_stage stage:

    ALTER STAGE data_stage REFRESH;
    

Call the Java UDF to read one or more staged PDF files and extract the content:

-- Input a stage name and file path.
SELECT process_pdf('@data_stage/myfile.pdf');

-- Input a file URL generated by the BUILD_STAGE_FILE_URL function.
SELECT process_pdf(build_stage_file_url('@data_stage', '/myfile.pdf'));

-- Input a file URL output from the BUILD_STAGE_FILE_URL function.
SELECT process_pdf('https://myaccount.snowflakecomputing.com/api/files/mydb/myschema/data_stage/myfile.pdf');

-- Input a scoped URL.
SELECT process_pdf(build_scoped_file_url('@data_stage', '/myfile.pdf'));

-- Process all of the PDF files in a directory table serially.
SELECT process_pdf(file_url)
  FROM directory(@data_stage);

-- Process all of the PDF files in a directory table in parallel.
SELECT process_pdf(file_url)
FROM (
    SELECT file_url
    FROM directory(@data_stage)
    GROUP BY file_url
);

Java UDTF Examples

The example in this section extracts and returns data from staged files using Java UDTFs.

Prerequisites: Create Data Stage

Create a stage to store your data files using a role that has the minimum required privileges, as described in Granting Privileges for User-Defined Functions.

The following SQL statement creates an internal stages to store the data files for the example:

-- Create an internal stage to store the data files. The stage includes a directory table.
CREATE OR REPLACE STAGE data_stage DIRECTORY=(ENABLE=TRUE) ENCRYPTION = (TYPE='SNOWFLAKE_SSE');

Process CSV Files

This example extracts the contents of a specified set of CSV files and returns the rows in a table.

Complete the following steps to create the Java UDTF and upload the require files:

  1. Create a Java UDTF that uses the SnowflakeFile class:

    CREATE OR REPLACE FUNCTION parse_csv(file string)
    RETURNS TABLE (col1 string, col2 string, col3 string )
    LANGUAGE JAVA
    HANDLER = 'CsvParser'
    as
    $$
    import org.xml.sax.SAXException;
    
    import java.io.*;
    import java.util.ArrayList;
    import java.util.List;
    import java.util.stream.Stream;
    import com.snowflake.snowpark_java.types.SnowflakeFile;
    
    public class CsvParser {
        public class Record {
            public String col1;
            public String col2;
            public String col3;
    
            public Record(String col1_value, String col2_value, String col3_value)
            {
                col1 = col1_value;
                col2 = col2_value;
                col3 = col3_value;
            }
        }
    
        public static Class getOutputClass() {
            return Record.class;
        }
    
        public Stream<Record> process(String file_url) throws IOException {
            SnowflakeFile file = SnowflakeFile.newInstance(file_url);
    
            String csvRecord = null;
            List<Record> rows = new ArrayList<>();
            BufferedReader csvReader = null;
    
            try {
                csvReader = new BufferedReader(new InputStreamReader(file.getInputStream()));
                while ((csvRecord = csvReader.readLine()) != null) {
                    String[] columns = csvRecord.split(",", 3);
                    rows.add(new Record(columns[0], columns[1], columns[2]));
                }
            } catch (IOException e) {
                throw new RuntimeException("Reading CSV failed.", e);
            } finally {
                if (csvReader != null)
                    try {
                        csvReader.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
            }
    
            return rows.stream();
        }
    }
    $$
    ;
    
  2. Copy the PDF file from the local temporary directory to the stage that stores data files:

    Linux/Mac
    PUT file:///tmp/sample.pdf @data_stage AUTO_COMPRESS=FALSE;
    
    Windows
    PUT file://C:\temp\sample.pdf @data_stage AUTO_COMPRESS=FALSE;
    
  3. Refresh the directory table for the data_stage stage:

    ALTER STAGE data_stage REFRESH;
    

Call the Java UDTF to read one or more staged CSV files and extract the contents in a table format:

-- Input a file URL.
SELECT * FROM TABLE(PARSE_CSV(BUILD_STAGE_FILE_URL(@data_stage, 'sample.csv')));
Back to top