Integrating Apache Hive Metastores with Snowflake

This topic provides instructions for using the Hive metastore connector for Snowflake to integrate Apache Hive metastores with Snowflake using external tables. The connector detects metastore events and transmits them to Snowflake to keep the external tables synchronized with the Hive metastore. This allows users to manage their schema in Hive while querying it from Snowflake.

The Hive connector currently integrates all tables across databases and schemas in a metastore with a single Snowflake database and schema.

Note

This feature is currently only supported for Hive tables that reference data files stored in Amazon S3 or Google Cloud Storage. Support for Microsoft Azure is planned.

In this Topic:

Installing and Configuring the Hive Metastore Connector

This section provides detailed instructions for installing and configuring the Hive metastore connector for Snowflake.

Prerequisites

The Hive connector for Snowflake has the following prerequisites:

Snowflake database and schema

Stores the external tables that map to the Hive tables in the metastore.

Designated Snowflake user

The connector will be configured to execute operations on the external tables as this user.

Storage integration

Storage integrations enable configuring secure access to external cloud storage without passing explicit cloud provider credentials such as secret keys or access tokens. Create a storage integration to access cloud storage locations referenced in Hive tables using CREATE STORAGE INTEGRATION.

The STORAGE_ALLOWED_LOCATIONS parameter for the storage integration must list the same storage containers (Amazon S3 or Google Cloud Storage) as the ones referenced in the Location parameter of the Hive tables in your metastore.

Role

The role must be assigned to the designated Snowflake user and include the following object privileges on the other Snowflake objects identified in this section:

Object

Privileges

Database

USAGE

Schema

USAGE , CREATE STAGE , CREATE EXTERNAL TABLE

Storage integration

USAGE

Step 1: Install the Connector

Complete the following steps to install the connector:

  1. Download the connector JAR file and configuration XML file from the Maven Central Repository:

    http://search.maven.org/ (or https://repo1.maven.org/maven2/net/snowflake/snowflake-hive-metastore-connector/)

  2. Copy the JAR file to the lib directory in the Hive classpath. The location can vary depending on the Hive installation. To determine the classpath, check the HIVE_AUX_JARS_PATH environment variable.

  3. Create a file named snowflake-config.xml file in the conf directory in the Hive classpath.

  4. Open the snowflake-config.xml file in a text editor and populate it with the following <name> properties and corresponding <values>:

    snowflake.jdbc.username

    Specifies the login name of the Snowflake user designated for refresh operations on the external tables.

    snowflake.jdbc.password

    Specifies the password for the login name. Note that you can set a placeholder for the password based on a system property or environment variable, depending on your Hadoop version. The configuration behaves like other Hadoop configurations. For more information, see the Hadoop documentation.

    snowflake.jdbc.account

    Specifies the name of your account (provided by Snowflake), e.g. xy12345.

    snowflake.jdbc.db

    Specifies an existing Snowflake database to use for the Hive metastore integration. See the Prerequisites section (in this topic) for more information.

    snowflake.jdbc.schema

    Specifies an existing Snowflake schema in the specified database. See the Prerequisites section (in this topic) for more information.

    snowflake.jdbc.role

    Specifies the access control role to use by the Hive connector. The role should be an existing role that has already been assigned to the specified user.

    If no role is specified here, then the Hive connector uses the default role for the specified user.

    snowflake.jdbc.connection

    Specifies the full name of your account (provided by Snowflake) in the following format:

    jdbc:snowflake://<account_name>.snowflakecomputing.com

    Note that your full account name may include additional segments that identify the region and cloud platform where your account is hosted.

    Account name examples by region

    If your account name is xy12345:

    Cloud Platform / Region

    Full Account Name

    AWS

    US West (Oregon)

    xy12345

    US East (Ohio)

    xy12345.us-east-2.aws

    US East (N. Virginia)

    xy12345.us-east-1

    US East (Commercial Gov - N. Virginia)

    xy12345.us-east-1-gov.aws

    Canada (Central)

    xy12345.ca-central-1.aws

    EU (Ireland)

    xy12345.eu-west-1

    EU (Frankfurt)

    xy12345.eu-central-1

    Asia Pacific (Tokyo)

    xy12345.ap-northeast-1.aws

    Asia Pacific (Mumbai)

    xy12345.ap-south-1.aws

    Asia Pacific (Singapore)

    xy12345.ap-southeast-1

    Asia Pacific (Sydney)

    xy12345.ap-southeast-2

    GCP

    US Central1 (Iowa)

    xy12345.us-central1.gcp

    Europe West2 (London)

    xy12345.europe-west2.gcp

    Europe West4 (Netherlands)

    xy12345.europe-west4.gcp

    Azure

    West US 2 (Washington)

    xy12345.west-us-2.azure

    East US 2 (Virginia)

    xy12345.east-us-2.azure

    US Gov Virginia

    xy12345.us-gov-virginia.azure

    Canada Central (Toronto)

    xy12345.canada-central.azure

    West Europe (Netherlands)

    xy12345.west-europe.azure

    Switzerland North (Zurich)

    xy12345.switzerland-north.azure

    Southeast Asia (Singapore)

    xy12345.southeast-asia.azure

    Australia East (New South Wales)

    xy12345.australia-east.azure

    Important

    If either of the following conditions is true, your account name is different than the structure described in this example:

    • If your Snowflake Edition is VPS, please contact Snowflake Support for details about your account name.

    • If AWS PrivateLink is enabled for your account, your account name requires an additional privatelink segment. For more details, see AWS PrivateLink & Snowflake.

    snowflake.hive-metastore-connector.integration

    Specifies the name of the storage integration object to use for secure access to the external storage locations referenced in Hive tables in the metastore. See the Prerequisites section (in this topic) for more information.

    Optionally add the following property:

    snowflake.hive-metastore-listener.database-filter-regex

    Specifies the names of any databases in the Hive metastore to skip with the integration. Using this property enables you to control which databases to integrate with Snowflake. This option is especially useful when multiple tables have the same name across Hive databases. Currently, in this situation, the Hive connector creates the first table with the name in the Snowflake target database but skips additional tables with the same name.

    For example, suppose databases mydb1, mydb2, and mydb3 all contain a table named table1. You could omit all databases with the naming convention mydb<number> except for mydb1 by adding the regular expression mydb[^1] as the property value.

    Sample property node

    <configuration>
      ..
      <property>
        <name>snowflake.hive-metastore-listener.database-filter-regex</name>
        <value>mydb[^1]</value>
      </property>
    </configuration>
    

    Sample :file:`snowflake-config.xml` file

    <configuration>
      <property>
        <name>snowflake.jdbc.username</name>
        <value>jsmith</value>
      </property>
      <property>
        <name>snowflake.jdbc.password</name>
        <value>mySecurePassword</value>
      </property>
      <property>
        <name>snowflake.jdbc.role</name>
        <value>custom_role1</value>
      </property>
      <property>
        <name>snowflake.jdbc.account</name>
        <value>myaccount</value>
      </property>
      <property>
        <name>snowflake.jdbc.db</name>
        <value>mydb</value>
      </property>
      <property>
        <name>snowflake.jdbc.schema</name>
        <value>myschema</value>
      </property>
      <property>
        <name>snowflake.jdbc.connection</name>
        <value>jdbc:snowflake://myaccount.snowflakecomputing.com</value>
      </property>
      <property>
        <name>snowflake.hive-metastore-listener.integration</name>
        <value>s3_int</value>
      </property>
    </configuration>
    
  5. Save the changes to the file.

  6. Open the existing Hive configuration file (hive-site.xml) in a text editor. Add the connector to the configuration file, as follows:

    <configuration>
     ...
     <property>
      <name>hive.metastore.event.listeners</name>
      <value>net.snowflake.hivemetastoreconnector.SnowflakeHiveListener</value>
     </property>
    </configuration>
    

    Note

    If there are other connectors already configured in this file, add the Hive connector for Snowflake in a comma-separated list in the <value> node.

  7. Save the changes to the file.

  8. Restart the Hive metastore service.

Step 2: Validate the Installation

  1. Create a new table in Hive.

  2. Query the list of external tables in your Snowflake database and schema using SHOW EXTERNAL TABLES:

    SHOW EXTERNAL TABLES IN <database>.<schema>;
    

    Where database and schema are the database and schema you specified in the snowflake-config.xml file in Step 1: Install the Connector (in this topic).

    The results should show an external table with the same name as the new Hive table.

Connector records are written to the Hive metastore logs. Queries executed by the connector can be viewed in the Snowflake QUERY_HISTORY view/function output similar to other queries.

Integrating Existing Hive Tables and Partitions with Snowflake

To integrate existing Hive tables and partitions with Snowflake, execute the following command in Hive for each table and partition:

ALTER TABLE <table_name> TOUCH [PARTITION partition_spec];

For more information, see the Hive documentation.

Alternatively, Snowflake provides a script for synching existing Hive tables and partitions. For information, see the GitHub project page

Features

Supported Hive Operations and Table Types

Hive Operations

The connector supports the following Hive operations:

  • Create table

  • Drop table

  • Alter table add column

  • Alter table drop column

  • Alter (i.e. touch) table

  • Add partition

  • Drop partition

  • Alter (touch) partition

Hive Table Types

The connector supports the following types of Hive tables:

  • External and managed tables

  • Partitioned and unpartitioned tables

Hive and Snowflake Data Types

The following table shows the mapping between Hive and Snowflake data types:

Hive

Snowflake

BIGINT

BIGINT

BINARY

BINARY

BOOLEAN

BOOLEAN

CHAR

CHAR

DATE

DATE

DECIMAL

DECIMAL

DOUBLE

DOUBLE

DOUBLE PRECISION

DOUBLE

FLOAT

FLOAT

INT

INT

INTEGER

INT

NUMERIC

DECIMAL

SMALLINT

SMALLINT

STRING

STRING

TIMESTAMP

TIMESTAMP

TINYINT

SMALLINT

VARCHAR

VARCHAR

All other data types

VARIANT

Note

During this preview, the data type mappings are still being tested. There may be differences in the semantics of the different types.

Supported File Formats and Options

The following data file formats and Hive file format options are supported:

  • CSV

    The following options are supported using the SerDe (Serializer/Deserializer) properties:

    • field.delim / separatorChar

    • line.delim

    • escape.delim / escapeChar

  • JSON

  • AVRO

  • ORC

  • PARQUET

    The following options are supported using the table properties:

    • parquet.compression.

Unsupported Hive Commands, Features, and Use Cases

The connector does not support the following Hive commands, features, and use cases:

  • Hive views

  • ALTER statements other than TOUCH, ADD COLUMNS, and DROP COLUMNS

  • Custom SerDe properties.

  • Modifying an existing managed Hive table to become an external Hive table, or vice versa

Refreshing External Table Metadata to Reflect Cloud Storage Events

When any of the Hive operations listed in Supported Hive Operations and Table Types (in this topic) are executed on a table, the Hive connector listens to the Hive events and subsequently refreshes the metadata for the corresponding external table in Snowflake.

However, the connector does not refresh the external table metadata based on events in cloud storage, such as adding or removing data files. To refresh the metadata for an external table to reflect events in the cloud storage, execute the respective ALTER TABLE … TOUCH command for your partitioned or unpartitioned Hive table. TOUCH reads the metadata and writes it back. For more information on the command, see the Hive documentation:

Partitioned Hive table

Execute the following command:

ALTER TABLE <table_name> TOUCH PARTITION <partition_spec>;
Unpartitioned Hive table

Execute the following command:

ALTER TABLE <table_name> TOUCH;

Differences Between Hive Tables and Snowflake External Tables

This section describes the main differences between Hive tables and Snowflake external tables.

Partitions
  • Snowflake partitions are composed of subpaths of the storage location referenced by the table, while Hive partitions do not have this constraint. If partitions are added in Hive tables that are not subpaths of the storage location, those partitions are not added to the corresponding external tables in Snowflake.

    For example, if the storage location associated with the Hive table (and corresponding Snowflake external table) is s3://path/, then all partition locations in the Hive table must also be prefixed by s3://path/.

  • Two Snowflake partitions in a single external table cannot point to the exact same storage location. For example, the following partitions conflict with each other:

    ALTER TABLE exttable ADD PARTITION (partcol='1') LOCATION 's3:///files/2019/05/12';
    
    ALTER TABLE exttable ADD PARTITION (partcol='2') LOCATION 's3:///files/2019/05/12';
    
Column names

Hive column names are case-insensitive, but Snowflake virtual columns derived from VALUES are case-sensitive. If Hive tables contain columns with mixed-case names, the data in those columns may be NULL in the corresponding columns in the Snowflake external tables.