Processing Unstructured Data Using Java UDFs or UDTFs

This topic describes how to read and process unstructured data in staged files using Java UDFs (user-defined functions) or tabular Java UDFs (user-defined table functions).

In this Topic:

Reading File Contents Using Java UDFs or UDTFs

Create Java UDFs or UDTFs using the CREATE FUNCTION command. Follow the guidance provided for the specific object type.

To read the contents of staged files, your Java UDFs or UDTFs can call methods in either the SnowflakeFile class or the InputStream class:

Class

Input

Notes

SnowflakeFile

Scoped URL, file URL, or string path for files located in an internal or external stage

Easily access additional file attributes, such as file size, from your UDF or UDTF.

InputStream

Scoped URL, file URL, or string path for files located in an internal or external stage

Using the Snowpark SnowflakeFile Class

The SnowflakeFile class (in the com.snowflake.snowpark_java.types package) is provided in the snowpark.jar file. For information about snowpark.jar, see Setting Up Your Development Environment for Snowpark Java. Note that Snowpark client applications cannot use this class; therefore, the class is not documented in the Snowpark topics.

The SnowflakeFile class has the following methods:

Method

Description

public static native SnowflakeFile newInstance(String url)

Returns a SnowflakeFile object for the file at the location specified by the url argument.

public synchronized InputStream getInputStream()

Returns a InputStream object for reading the contents of the file.

public synchronized Long getSize()

Returns the size of the file.

Include the following Java code in your function definition to access staged files.

import com.snowflake.snowpark_java.types.SnowflakeFile;
import java.io.InputStream;

public class <class_name> {
  public static int <handler>(String url) {
    SnowflakeFile file = SnowflakeFile.newInstance(url);

    try (InputStream ins = file.getInputStream()) {
      while (ins.read() != -1) {
        ...
      }
    }
    return ...;
  }
}

Where:

<class_name> is the name of your class. <handler_name> is the name of your Java method, or “handler method”.

For example:

public class MyUDFClass {
  public static int myUdf(String url) {
    SnowflakeFile file = SnowflakeFile.newInstance(url);

    try (InputStream ins = file.getInputStream()) {
      while (ins.read() != -1) {
        ...
      }
    }
    return ...;
  }
}

When testing a Java UDF or UDTF, you can use a mock class derived from SnowflakeFile. Override the newInstance() method and all other methods.

Using the InputStream Class

The public class in the Java code used to access staged files must include the following function signature:

public static String <function_name>(InputStream <input_stream_name>)

Snowflake reads the files specified in the input and passes an InputStream object for reading the file contents to your handler method.

Considerations

  • If a query includes a view that calls any UDF or UDTF, regardless if the function accesses staged files or not, the query currently fails with a user error.

  • UDTFs can process multiple files in parallel; however, UDFs currently process files serially. As a workaround, group rows in a subquery using the GROUP BY clause. See Java UDF Examples for an example.

  • Currently, if the staged files referenced in a query are modified or deleted while the query is running, the Java UDF returns an exception.

Access Control Privileges

This section lists the minimum privileges required on objects to perform specific SQL actions.

Creating Java UDFs or UDTFs

Creating, managing, and executing Java UDF or UDTF requires a role with a minimum of the following privileges:

Object

Privileges

Notes

Database

USAGE

Schema

USAGE, CREATE FUNCTION

Stage

USAGE (external stage) or READ (internal stage)

Stage where you uploaded the following files:

  • JAR (Java ARchive) file containing the compiled Java source code for the Java UDF, if you have written a pre-compiled UDF.

  • JAR files for any additional libraries that your Java code depends on.

Owning Java UDFs or UDTFs

After a Java UDF or UDTF is created, the function owner (i.e. the role that has the OWNERSHIP privilege on the function) must have a minimum of the following privileges:

Object

Privilege

Notes

Database

USAGE

Schema

USAGE

Stage

USAGE (external stage) or READ (internal stage)

Stage where you uploaded the following files:

  • JAR (Java ARchive) file containing the compiled Java source code for the Java UDF, if you have written a pre-compiled UDF.

  • JAR files for any additional libraries that your Java code depends on.

Calling Java UDFs or UDTFs

A role that calls a Java UDF or UDTF to process unstructured data must have a minimum of the following privileges:

Object

Privilege

Notes

Database

USAGE

Schema

USAGE

Schema that contains the schema-level objects in this table. If the objects are contained in multiple schemas, the USAGE privilege is required on each.

Stage

USAGE (external stage) or READ (internal stage)

Stage where you uploaded the following files:

  • JAR (Java ARchive) file containing the compiled Java source code for the Java UDF, if you have written a pre-compiled UDF.

  • JAR files for any additional libraries that your Java code depends on.

  • Unstructured files to process.

If the above files are located in different stages, the appropriate privilege for each stage must be granted to the role.

Java UDF Examples

The example in this section processes staged unstructured files using Java UDFs, which extract and return text from the files.

Prerequisites: Create Stages

The examples use in-line Java UDFs (as opposed to pre-compiled Java UDFs), which means that you do not need to compile, package, and upload the Java code for your UDF to a stage.

However, the examples depend on a separate library that is packaged in a JAR file. You must upload the JAR file for that library to a stage. The examples use an internal stage to store the JAR file for this library.

Although the unstructured data files processed by a Java UDF can be located in the same stage as the JAR files, in these examples, the data files are located in a separate internal stage.

Create the stages using a role that has the minimum required privileges, as described in Access Control Privileges (in this topic).

The following SQL statements create separate internal stages to separately store the JAR files and data files for the examples:

-- Create an internal stage to store the JAR files.
CREATE OR REPLACE STAGE jars_stage;

-- Create an internal stage to store the data files. The stage includes a directory table.
CREATE OR REPLACE STAGE data_stage DIRECTORY=(ENABLE=TRUE) ENCRYPTION = (TYPE='SNOWFLAKE_SSE');

Process PDF Files

This example extracts the contents of a specified PDF file using Apache PDFBox.

Complete the following steps to create the Java UDF and upload the require files:

  1. Copy the JAR file for Apache PDFBox from the local temporary directory to the stage that stores JAR files:

    Linux/Mac
    PUT file:///tmp/pdfbox-app-2.0.25.jar @jars_stage AUTO_COMPRESS=FALSE;
    
    Windows
    PUT file://C:\temp\pdfbox-app-2.0.25.jar @jars_stage AUTO_COMPRESS=FALSE;
    
  2. Create a Java UDF to parse PDF documents and retrieve the content from each document. You can use either the SnowflakeFile class or the InputStream class in your UDF code:

    Using SnowflakeFile Class
    CREATE FUNCTION process_pdf(file string)
    RETURNS string
    LANGUAGE java
    RUNTIME_VERSION = 11
    IMPORTS = ('@jars_stage/pdfbox-app-2.0.25.jar')
    HANDLER = 'PdfParser.readFile'
    as
    $$
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.PDFTextStripperByArea;
    import com.snowflake.snowpark_java.types.SnowflakeFile;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    
    public class PdfParser {
    
        public static String readFile(String fileURL) throws IOException {
            SnowflakeFile file = SnowflakeFile.newInstance(fileURL);
            try (PDDocument document = PDDocument.load(file.getInputStream())) {
    
                document.getClass();
    
                if (!document.isEncrypted()) {
    
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);
    
                    PDFTextStripper tStripper = new PDFTextStripper();
    
                    String pdfFileInText = tStripper.getText(document);
                    return pdfFileInText;
                }
            }
    
            return null;
        }
    }
    $$;
    
    Using InputStream Class
    CREATE FUNCTION process_pdf(file string)
    RETURNS string
    LANGUAGE java
    RUNTIME_VERSION = 11
    IMPORTS = ('@jars_stage/pdfbox-app-2.0.25.jar')
    HANDLER = 'PdfParser.readFile'
    as
    $$
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.PDFTextStripperByArea;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    
    public class PdfParser {
    
        public static String readFile(InputStream stream) throws IOException {
            try (PDDocument document = PDDocument.load(stream)) {
    
                document.getClass();
    
                if (!document.isEncrypted()) {
    
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);
    
                    PDFTextStripper tStripper = new PDFTextStripper();
    
                    String pdfFileInText = tStripper.getText(document);
                    return pdfFileInText;
                }
            }
            return null;
        }
    }
    $$;
    
  3. Copy the PDF file from the local temporary directory to the stage that stores data files:

    Linux/Mac
    PUT file:///tmp/myfile.pdf @data_stage AUTO_COMPRESS=FALSE;
    
    Windows
    PUT file://C:\temp\myfile.pdf @data_stage AUTO_COMPRESS=FALSE;
    
  4. Refresh the directory table for the data_stage stage:

    ALTER STAGE data_stage REFRESH;
    

Call the Java UDF to read one or more staged PDF files and extract the content:

-- Input a stage name and file path.
SELECT process_pdf('@data_stage/myfile.pdf');

-- Input a file URL generated by the BUILD_STAGE_FILE_URL function.
SELECT process_pdf(build_stage_file_url('@data_stage', '/myfile.pdf'));

-- Input a file URL output from the BUILD_STAGE_FILE_URL function.
SELECT process_pdf('https://myaccount.snowflakecomputing.com/api/files/mydb/myschema/data_stage/myfile.pdf');

-- Input a scoped URL.
SELECT process_pdf(build_scoped_file_url('@data_stage', '/myfile.pdf'));

-- Process all of the PDF files in a directory table serially.
SELECT process_pdf(file_url)
  FROM directory(@data_stage);

-- Process all of the PDF files in a directory table in parallel.
SELECT process_pdf(file_url)
FROM (
    SELECT file_url
    FROM directory(@data_stage)
    GROUP BY file_url
);

Java UDTF Examples

The example in this section extracts and returns data from staged files using Java UDTFs.

Prerequisites: Create Data Stage

Create a stage to store your data files using a role that has the minimum required privileges, as described in Access Control Privileges (in this topic).

The following SQL statement creates an internal stages to store the data files for the example:

-- Create an internal stage to store the data files. The stage includes a directory table.
CREATE OR REPLACE STAGE data_stage DIRECTORY=(ENABLE=TRUE) ENCRYPTION = (TYPE='SNOWFLAKE_SSE');

Process CSV Files

This example extracts the contents of a specified set of CSV files and returns the rows in a table.

Complete the following steps to create the Java UDTF and upload the require files:

  1. Create a Java UDTF that uses the SnowflakeFile class:

    CREATE OR REPLACE FUNCTION parse_csv(file string)
    RETURNS TABLE (col1 string, col2 string, col3 string )
    LANGUAGE JAVA
    HANDLER = 'CsvParser'
    as
    $$
    import org.xml.sax.SAXException;
    
    import java.io.*;
    import java.util.ArrayList;
    import java.util.List;
    import java.util.stream.Stream;
    import com.snowflake.snowpark_java.types.SnowflakeFile;
    
    public class CsvParser {
        public class Record {
            public String col1;
            public String col2;
            public String col3;
    
            public Record(String col1_value, String col2_value, String col3_value)
            {
                col1 = col1_value;
                col2 = col2_value;
                col3 = col3_value;
            }
        }
    
        public static Class getOutputClass() {
            return Record.class;
        }
    
        public Stream<Record> process(String file_url) throws IOException {
            SnowflakeFile file = SnowflakeFile.newInstance(file_url);
    
            String csvRecord = null;
            List<Record> rows = new ArrayList<>();
            BufferedReader csvReader = null;
    
            try {
                csvReader = new BufferedReader(new InputStreamReader(file.getInputStream()));
                while ((csvRecord = csvReader.readLine()) != null) {
                    String[] columns = csvRecord.split(",", 3);
                    rows.add(new Record(columns[0], columns[1], columns[2]));
                }
            } catch (IOException e) {
                throw new RuntimeException("Reading CSV failed.", e);
            } finally {
                if (csvReader != null)
                    try {
                        csvReader.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
            }
    
            return rows.stream();
        }
    }
    $$
    ;
    
  2. Copy the PDF file from the local temporary directory to the stage that stores data files:

    Linux/Mac
    PUT file:///tmp/sample.pdf @data_stage AUTO_COMPRESS=FALSE;
    
    Windows
    PUT file://C:\temp\sample.pdf @data_stage AUTO_COMPRESS=FALSE;
    
  3. Refresh the directory table for the data_stage stage:

    ALTER STAGE data_stage REFRESH;
    

Call the Java UDTF to read one or more staged CSV files and extract the contents in a table format:

-- Input a file URL.
SELECT * FROM TABLE(PARSE_CSV(BUILD_STAGE_FILE_URL(@data_stage, 'sample.csv')));
Back to top