Query Apache Iceberg™ tables with an external engine through Snowflake Horizon Catalog¶

This preview introduces support for querying Snowflake-managed Apache Iceberg™ tables by using an external query engine through Snowflake Horizon Catalog. To ensure this interoperability with external engines, Apache Polaris™ (incubating) is integrated into Horizon Catalog. In addition, Horizon Catalog exposes Apache Iceberg™ REST APIs, which lets you to read the tables by using external query engines.

To query Snowflake-managed Iceberg tables with an external query engine, you can use this feature instead of syncing Snowflake-managed Iceberg tables with Snowflake Open Catalog. For more information about Open Catalog, see Snowflake Open Catalog overview.

By connecting an external query engine to Iceberg tables through Horizon Catalog, you can perform the following tasks:

Use any external query engine that supports the open Iceberg REST protocol to query these tables, such as Apache Spark™.
Query any existing and new Snowflake-managed Iceberg tables in a new or existing Snowflake account by using a single Horizon Catalog endpoint.
Query the tables by using your existing users, roles, policies, and authentication in Snowflake.
Use vended credentials.

For more information about Snowflake Horizon Catalog, see Snowflake Horizon Catalog.

The following diagram shows external query engines reading Snowflake-managed Iceberg tables through Horizon Catalog and Snowflake reading and writing to these tables:

Diagram that shows external query engines reading Snowflake-managed Iceberg tables through Horizon Catalog and Snowflake reading and writing to these tables.

Billing¶

The Horizon Iceberg REST Catalog API is available in all Snowflake editions.
The API requests are billed as 0.5 credit per million calls and charged as Cloud Services.
For cross-region data access, standard cross-region data egress charges as stated in the Snowflake Service Consumption Table are applicable.

Note

Customers won’t be billed until this feature becomes generally available.

Before you begin¶

Retrieve the account identifier for your Snowflake account that contains the Iceberg tables that you want to query. For instructions, see Account identifiers. You specify this identifier when you connect an external query engine to your Iceberg tables.

Tip

To get your account identifier by using SQL, you can run the following command:

SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME();

Copy

Private connectivity (Optional)¶

For secure connectivity, consider configuring Inbound and Outbound private connectivity for your Snowflake account while you access the Horizon Catalog endpoint.

Note

Private connectivity is only supported for Snowflake-managed Iceberg tables stored on Amazon S3 or Azure Storage (ADLS).

Workflow for querying Iceberg tables by using an external query engine¶

To query Iceberg tables by using an external query engine, complete the following steps:

Create Iceberg tables.
Configure access control.
Obtain an access token for authentication.
Connect an external query engine to Iceberg tables through Horizon Catalog.
Query Iceberg tables.

Step 1: Create Iceberg tables¶

Important

If you already have Snowflake-managed Iceberg tables you want to query, you can skip this step.

In this step, you create Snowflake-managed Iceberg tables that use Snowflake as the catalog, so you can query them with an external query engine. For instructions, see the following topics:

Tutorial: Create your first Apache Iceberg™ table: A tutorial that shows how to create a database, create a Snowflake-managed Iceberg table, and load data into the table.
Create a Snowflake-managed Iceberg table: Example code for creating a Snowflake-managed Iceberg table.

Step 2: Configure access control¶

Important

Snowflake roles that include the hyphen character (-) in the role name aren’t supported when you access Iceberg tables through the Horizon Catalog endpoint.
If you already have roles that are configured with access to the Iceberg tables that you want to query, you can skip this step.

In this step, you configure access control for the Snowflake-managed Iceberg tables that you want to query with an external query engine. For example, you can set up the following roles in Snowflake:

DATA_ENGINEER role, which has access to all schemas and all Snowflake-managed Iceberg tables in a database.
DATA_ANALYST role, which has access to one schema in the database and only access to two Snowflake-managed Iceberg tables within that schema.

For instructions, see Configuring access control. For more information about access control in Snowflake, see Overview of Access Control.

Step 3: Obtain an access token for authentication¶

In this step, you obtain an access token, which you must have to authenticate to the Horizon Catalog endpoint for your Snowflake account. You need to obtain an access token for each user — service or human — and role that is configured with access to Snowflake-managed Iceberg tables. For example, you need to obtain one access token for a user with DATA_ENGINEER role and another user with a DATA_ANALYST role.

You specify this access token later when you connect an external query engine to Iceberg tables through Horizon Catalog.

You can obtain an access token by using one of the following authentication options:

External OAuth¶

If you’re using External OAuth, generate an access token for your identity provider. For instructions, see External OAuth overview.

Note

For External OAuth, alternatively, you can configure your connection to the engine with automatic token refresh instead of specifying an access token.

Key-pair authentication¶

If you use key-pair authentication, to obtain an access token, you sign a JSON web token (JWT) with your private key.

The following steps cover how to generate an access token for key-pair authentication:

Step 1: Configure key-pair authentication¶

In this step, you perform the following tasks:

Generate a private key
Generate a public key
Store the private and public keys securely
Grant the privilege to assign a public key to a Snowflake user
Assign the public key to a Snowflake user
Verify the user’s public key fingerprint

For instructions, see Configuring key-pair authentication.

Step 2: Grant a role to the user¶

Run the GRANT ROLE command to grant the Snowflake role that has privileges to the tables you want to query to the key-pair authentication user. For example, to grant the ENGINEER role to the my_service_user user, run the following command:

GRANT ROLE ENGINEER to user my_service_user;

Copy

Step 3: Generate a JSON Web Token (JWT)¶

In this step, you use SnowSQL to generate a JSON Web Token (JWT) for key-pair authentication.

Note

You must have SnowSQL installed on your machine.
Alternatively, you can use Python, Snowflake CLI, Java, or Node.js to generate a JWT. For an example, see the following sections:

Use SnowSQL to generate a JWT:

snowsql --private-key-path "<private_key_file>" \
  --generate-jwt \
  -h "<account_identifier>.snowflakecomputing.com" \
  -a "<account_locator>" \
  -u "<user_name>"

Copy

Where:

<private_key_file> is the path to your private key file that corresponds to the public key assigned to your Snowflake user. For example: /Users/jsmith/.ssh/rsa_key.p8.
<account_identifier> is the account identifier for your Snowflake account, in the format <organization_name>-<account_name>. To find the account identifier, see Before you begin. An example of an account identifier is myorg-myaccount.
<account_locator> is the account locator for your Snowflake account.

To find your account locator, see Locate your Snowflake account information in Snowsight and view the Account locator in the Account Details dialog.
<user_name> is the user name for a Snowflake user with the public key assigned to the user.

Step 4: Generate an access token¶

Important

To generate an access token, you must first generate a JWT. You must first generate a JWT because you use the JWT to generate the access token.

Use a curl command to generate an access token:

curl -i --fail -X POST "https://<account_identifier>.snowflakecomputing.com/polaris/api/catalog/v1/oauth/tokens" \
 --header 'Content-Type: application/x-www-form-urlencoded' \
 --data-urlencode 'grant_type=client_credentials' \
 --data-urlencode 'scope=session:role:<role>' \
 --data-urlencode 'client_secret=<JWT_token>'

Copy

Where:

<account_identifier> is the account identifier for your Snowflake account, in the format <organization_name>-<account_name>. To find the account identifier, see Before you begin. An example of an account identifier is myorg-myaccount.
<role> is the Snowflake role that is granted access to Iceberg tables, such as ENGINEER.
<JWT_token> Is the JWT that you generated in the previous step.

Programmatic access token (PAT)¶

If you use PATs, generate a PAT for authentication.

First, you generate a PAT, which you use to connect an external query engine to Iceberg tables. Then, you generate an access token, which you only use to verify the permissions for your PAT.

Step 1: Generate a PAT¶

For instructions on how to configure and generate a PAT, see Using programmatic access tokens for authentication.

Step 2: Generate an access token for your PAT¶

In this step, you generate an access token for your PAT.

Attention

You only specify the access token that you generate in this step when you verify the permissions for your PAT. When you connect an external query engine to Iceberg tables, you must specify your PAT that you generated in the previous step, not the access token that you generate in this step.

Use a curl command to generate an access token for your PAT:

curl -i --fail -X POST "https://<account_identifier>.snowflakecomputing.com/polaris/api/catalog/v1/oauth/tokens" \
 --header 'Content-Type: application/x-www-form-urlencoded' \
 --data-urlencode 'grant_type=client_credentials' \
 --data-urlencode 'scope=session:role:<role>' \
 --data-urlencode 'client_secret=<PAT_token>'

Copy

Where:

<account_identifier> is the account identifier for your Snowflake account, in the format <organization_name>-<account_name>. To find the account identifier, see Before you begin. An example of an account identifier is myorg-myaccount.
<role> is the Snowflake role that is granted to your PAT and has access to the Iceberg tables you want to query, such as ENGINEER.
<PAT_token> is the value for the PAT token that you generated in the previous step.

Step 4: Verify access token permissions¶

In this step, you verify the permissions for the access token that you obtained in the previous step.

Verify access to the Horizon IRC endpoint
Retrieve the metadata for a table

Verify access to the Horizon IRC endpoint¶

Use a curl command to verify that you have permission to access your Horizon IRC endpoint:

curl -i --fail -X GET "https://<account_identifier>.snowflakecomputing.com/polaris/api/catalog/v1/config?warehouse=<database_name>" \
-H "Authorization: Bearer <access_token>" \
-H "Content-Type: application/json"

Copy

Where:

<account_identifier> is the account identifier for your Snowflake account, in the format <organization_name>-<account_name>. To find the account identifier, see Before you begin. An example of an account identifier is myorg-myaccount.
<access_token> is your access token that you generated. If you’re using a PAT, this value is the access token you generated, not the personal access token (PAT) you generated.
<database_name> is the name of the database you want to query.

Important

You must specify the database name in all capital letters, even if it was created with lowercase letters.

Example return value:

{
  "defaults": {
    "default-base-location": ""
  },
  "overrides": {
    "prefix": "MY-DATABASE"
  }
}

Retrieve the metadata for a table¶

You can also make a GET request to retrieve the metadata for a table. Snowflake uses the loadTable operation to load table metadata from your REST catalog.

curl -i --fail -X GET "https://<account_identifier>.snowflakecomputing.com/polaris/api/catalog/v1/<database_name>/namespaces/<namespace_name>/tables/<table_name>" \
 -H "Authorization: Bearer <access_token>" \
 -H "Content-Type: application/json"

Copy

Where:

<account_identifier> is the account identifier for your Snowflake account, in the format <organization_name>-<account_name>. To find the account identifier, see Before you begin. An example of an account identifier is myorg-myaccount.
<database_name> is the database of the table whose metadata you want to retrieve.
<namespace_name> is the namespace of the table whose metadata you want to retrieve.
<table_name> is the table whose metadata you want to retrieve.
<access_token> is your access token that you generated. If you’re using a PAT, this value is the access token you generated, not the personal access token (PAT) you generated.

Important

You must specify the database, namespaces, and table names in all capital letters, even if the object was created with lowercase letters.

Step 5: Connect an external query engine to Iceberg tables through Horizon Catalog¶

In this step, you connect an external query engine to Iceberg tables through Horizon Catalog. This connection allows you to query the tables by using the external query engine.

The external engines use the Apache Iceberg™ REST endpoint exposed by Snowflake. For your Snowflake account, this endpoint is in the following format:

https://<account_identifier>.snowflakecomputing.com/polaris/api/catalog

Copy

The example code in this step shows how to set up a connection in Spark, and the example code is in PySpark. For more information, see the following topics:

Connect by using External OAuth or key pair authentication
Connect by using a programmatic access token (PAT)

Connect by using External OAuth or key pair authentication¶

Use the following example code to connect the external query engine to Iceberg tables by using External OAuth or key pair authentication:

# Snowflake Horizon Catalog Configuration, change as per your environment

CATALOG_URI = "https://<account_identifier>.snowflakecomputing.com/polaris/api/catalog"
HORIZON_SESSION_ROLE = f"session:role:<role>"
CATALOG_NAME = "<database_name>" #provide in UPPER CASE

# Cloud Service Provider Region Configuration (where the Iceberg data is stored)
REGION = "eastus2"

# Paste the External Oauth Access token that you generated in Snowflake here
ACCESS_TOKEN = "<your_access_token>"

# Iceberg Version
ICEBERG_VERSION = "1.9.1"

def create_spark_session():
  """Create and configure Spark session for Snowflake Iceberg access."""
  spark = (
      SparkSession.builder
      .appName("SnowflakeIcebergReader")
      .master("local[*]")

# JAR Dependencies for Iceberg and Azure
      .config(
          "spark.jars.packages",
          f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION},"
          f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"
          # for Azure storage, use the below package and comment above azure bundle
          # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"
      )

      # Iceberg SQL Extensions
      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
      .config("spark.sql.defaultCatalog", CATALOG_NAME)

      # Horizon REST Catalog Configuration
      .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.type", "rest")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.uri", CATALOG_URI)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", CATALOG_NAME)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.token", ACCESS_TOKEN)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.scope", HORIZON_SESSION_ROLE)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.client.region", REGION)

      # Required for vended credentials
      .config(f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation", "vended-credentials")
      .config("spark.sql.iceberg.vectorization.enabled", "false")
      .getOrCreate()
  )
  spark.sparkContext.setLogLevel("ERROR")
  return spark

Copy

Where:

<account_identifier> is your Snowflake account identifier for the Snowflake account that contains the Iceberg tables that you want to query. To find this identifier, see Before you begin.
<your_access_token> is your access token that you obtained. To obtain it, see Step 3: Obtain an access token for authentication.

Note

For External OAuth, alternatively, you can configure your connection to the engine with automatic token refresh instead of specifying an access token.
<database_name> is the name of the database in your Snowflake account that contains Snowflake-managed Iceberg tables that you want to query.

Note

The .warehouse property in Spark expects your Snowflake database name, not your Snowflake warehouse name.
<role> is the role in Snowflake that is configured with access to the Iceberg tables that you want to query. For example: DATA_ENGINEER.

Important

By default, the code example is set up for Apache Iceberg™ tables stored on Amazon S3. If your Iceberg tables are stored on Azure Storage (ADLS), perform the following steps:

Comment out the following line: f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"

Uncomment the following line: # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"

Connect by using a programmatic access token (PAT)¶

Use the following example code to connect the external query engine to Iceberg tables by using a programmatic access token (PAT):

# Snowflake Horizon Catalog Configuration, change as per your environment

CATALOG_URI = "https://<account_identifier>.snowflakecomputing.com/polaris/api/catalog"
HORIZON_SESSION_ROLE = f"session:role:<role>"
CATALOG_NAME = "<database_name>" #provide in UPPER CASE

# Cloud Service Provider Region Configuration (where the Iceberg data is stored)
REGION = "eastus2"

# Paste the PAT you generated in Snowflake here
PAT_TOKEN = "<your_PAT_token>"

# Iceberg Version
ICEBERG_VERSION = "1.9.1"

def create_spark_session():
  """Create and configure Spark session for Snowflake Iceberg access."""
  spark = (
      SparkSession.builder
      .appName("SnowflakeIcebergReader")
      .master("local[*]")

# JAR Dependencies for Iceberg and Azure
      .config(
          "spark.jars.packages",
          f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION},"
          f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"
          # for Azure storage, use the below package and comment above azure bundle
          # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"
      )

      # Iceberg SQL Extensions
      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
      .config("spark.sql.defaultCatalog", CATALOG_NAME)

      # Horizon REST Catalog Configuration
      .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.type", "rest")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.uri", CATALOG_URI)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", CATALOG_NAME)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.credential", PAT_TOKEN)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.scope", HORIZON_SESSION_ROLE)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.client.region", REGION)

      # Required for vended credentials
      .config(f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation", "vended-credentials")
      .config("spark.sql.iceberg.vectorization.enabled", "false")
      .getOrCreate()
  )
  spark.sparkContext.setLogLevel("ERROR")
  return spark

Copy

Where:

<account_identifier> is your Snowflake account identifier for the Snowflake account that contains the Iceberg tables that you want to query. To find this identifier, see Before you begin.
<your_PAT_token> is your PAT that you obtained. To obtain it, see Step 3: Obtain an access token for authentication.
<role> is the role in Snowflake that is configured with access to the Iceberg tables that you want to query. For example: DATA_ENGINEER.
<database_name> is the name of the database in your Snowflake account that contains Snowflake-managed Iceberg tables that you want to query.

Note

The .warehouse property in Spark expects your Snowflake database name, not your Snowflake warehouse name.

Important

By default, the code example is set up for Apache Iceberg™ tables stored on Amazon S3. If your Iceberg tables are stored on Azure Storage (ADLS), perform the following steps:

Comment out the following line: f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"

Uncomment the following line: # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"

Step 6: Query Iceberg tables¶

This step provides the following code examples for using Apache Spark™ to query Iceberg tables:

Show namespaces
Use namespaces
Show tables
Query a table

Show namespaces¶

spark.sql("show namespaces").show()

Copy

Use namespace¶

spark.sql("use namespace <your_schema_name_in_snowflake>")

Copy

Show tables¶

spark.sql("show tables").show()

Copy

Query a table¶

spark.sql("use namespace spark_demo")
spark.sql("select * from <your_table_name_in_snowflake>").show()

Copy

Considerations for querying Iceberg tables with an external query engine¶

Consider the following items when you query Iceberg tables with an external query engine:

For tables in Snowflake:
- Only Snowflake-managed Iceberg tables are supported.
- Querying remote or externally managed Iceberg tables including Delta Direct and Parquet Direct tables and Snowflake native tables, aren’t supported.
You can query but can’t write to Iceberg tables.
The external reads are supported only on Iceberg version 2 or earlier.
This feature is only supported for Snowflake-managed Iceberg tables stored on Amazon S3, Google Cloud, or Azure for all public cloud regions. S3-compatible non-AWS storage is not yet supported.
You can’t query an Iceberg table through the Horizon Iceberg REST API if the following fine-grained access control (FGAC) policies are defined on the table:
- Row access policies
- Column-level security
Snowflake roles that include the hyphen character (-) in the role name aren’t supported when you access Iceberg tables through the Horizon Catalog endpoint.
Explicitly granting the Horizon Catalog endpoint access to your storage accounts isn’t supported. We recommend that you use private connectivity for secure connectivity from external engines to Horizon Catalog and from Horizon Catalog to storage account.