Query Apache Iceberg™ tables with an external engine through Snowflake Horizon Catalog

This preview introduces support for querying Snowflake-managed Apache Iceberg™ tables by using an external query engine through Snowflake Horizon Catalog. To ensure this interoperability with external engines, Apache Polaris™ (incubating) is integrated into Horizon Catalog. In addition, Horizon Catalog exposes Apache Iceberg™ REST APIs, which lets you to read the tables by using external query engines.

To query Snowflake-managed Iceberg tables with an external query engine, you can use this feature instead of syncing Snowflake-managed Iceberg tables with Snowflake Open Catalog. For more information about Open Catalog, see Snowflake Open Catalog overview.

By connecting an external query engine to Iceberg tables through Horizon Catalog, you can perform the following tasks:

  • Use any external query engine that supports the open Iceberg REST protocol to query these tables, such as Apache Spark™.

  • Query any existing and new Snowflake-managed Iceberg tables in a new or existing Snowflake account by using a single Horizon Catalog endpoint.

  • Query the tables by using your existing users, roles, policies, and authentication in Snowflake.

  • Use vended credentials.

For more information about Snowflake Horizon Catalog, see Snowflake Horizon Catalog.

The following diagram shows external query engines reading Snowflake-managed Iceberg tables through Horizon Catalog and Snowflake reading and writing to these tables:

Diagram that shows external query engines reading Snowflake-managed Iceberg tables through Horizon Catalog and Snowflake reading and writing to these tables.

Billing

  • The Horizon Iceberg REST Catalog API is available in all Snowflake editions.

  • The API requests are billed as 0.5 credit per million calls and charged as Cloud Services.

  • For cross-region data access, standard cross-region data egress charges as stated in the Snowflake Service Consumption Table are applicable.

Note

Customers won’t be billed until this feature becomes generally available.

Before you begin

Retrieve the account identifier for your Snowflake account that contains the Iceberg tables that you want to query. For instructions, see Account identifiers. You specify this identifier when you connect an external query engine to your Iceberg tables.

Tip

To get your account identifier by using SQL, you can run the following command:

SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME();
Copy

Private connectivity (Optional)

For secure connectivity, consider configuring Inbound and Outbound private connectivity for your Snowflake account while you access the Horizon Catalog endpoint.

Note

Private connectivity is only supported for Snowflake-managed Iceberg tables stored on Amazon S3 or Azure Storage (ADLS).

Workflow for querying Iceberg tables by using an external query engine

To query Iceberg tables by using an external query engine, complete the following steps:

  1. Create Iceberg tables.

  2. Configure access control.

  3. Obtain an access token for authentication.

  4. Connect an external query engine to Iceberg tables through Horizon Catalog.

  5. Query Iceberg tables.

Step 1: Create Iceberg tables

Important

If you already have Snowflake-managed Iceberg tables you want to query, you can skip this step.

In this step, you create Snowflake-managed Iceberg tables that use Snowflake as the catalog, so you can query them with an external query engine. For instructions, see the following topics:

Step 2: Configure access control

Important

  • Snowflake roles that include the hyphen character (-) in the role name aren’t supported when you access Iceberg tables through the Horizon Catalog endpoint.

  • If you already have roles that are configured with access to the Iceberg tables that you want to query, you can skip this step.

In this step, you configure access control for the Snowflake-managed Iceberg tables that you want to query with an external query engine. For example, you can set up the following roles in Snowflake:

  • DATA_ENGINEER role, which has access to all schemas and all Snowflake-managed Iceberg tables in a database.

  • DATA_ANALYST role, which has access to one schema in the database and only access to two Snowflake-managed Iceberg tables within that schema.

For instructions, see Configuring access control. For more information about access control in Snowflake, see Overview of Access Control.

Step 3: Obtain an access token for authentication

In this step, you obtain an access token, which you must have to authenticate to the Horizon Catalog endpoint for your Snowflake account. You need to obtain an access token for each user — service or human — and role that is configured with access to Snowflake-managed Iceberg tables. For example, you need to obtain one access token for a user with DATA_ENGINEER role and another user with a DATA_ANALYST role.

You specify this access token later when you connect an external query engine to Iceberg tables through Horizon Catalog,

You can obtain an access token by using one of the following authentication options:

  • External OAuth: If you’re using External OAuth, generate an access token for your identity provider. For instructions, see External OAuth overview.

    Note

    For External OAuth, alternatively, you can configure your connection to the engine with automatic token refresh instead of specifying an access token.

  • Key-pair authentication: If you use key-pair authentication, to obtain an access token, you sign a JSON web token (JWT) with your private key. For instructions, see Key-pair authentication.

  • Programmatic access token (PAT): If you use PATs, generate a PAT for authentication. For instructions, see Using programmatic access tokens for authentication

Step 4: Connect an external query engine to Iceberg tables through Horizon Catalog

In this step, you connect an external query engine to Iceberg tables through Horizon Catalog. This connection allows you to query the tables by using the external query engine.

The external engines use the Apache Iceberg™ REST endpoint exposed by Snowflake. For your Snowflake account, this endpoint is in the following format:

https://<accountidentifier>.snowflakecomputing.com/polaris/api/catalog
Copy

The example code in this step shows how to set up a connection in Spark, and the example code is in PySpark. For more information, see the following topics:

Connect by using External OAuth or key pair authentication

Use the following example code to connect the external query engine to Iceberg tables by using External OAuth or key pair authentication:

# Snowflake Horizon Catalog Configuration, change as per your environment

CATALOG_URI = "https://<accountidentifier>.snowflakecomputing.com/polaris/api/catalog"
HORIZON_SESSION_ROLE = f"session:role:<role>"
CATALOG_NAME = "<database_name>" #provide in UPPER CASE

# Cloud Service Provider Region Configuration (where the Iceberg data is stored)
REGION = "eastus2"

# Paste the External Oauth Access token that you generated in Snowflake here
ACCESS_TOKEN = "<your_access_token>"

# Iceberg Version
ICEBERG_VERSION = "1.9.1"

def create_spark_session():
  """Create and configure Spark session for Snowflake Iceberg access."""
  spark = (
      SparkSession.builder
      .appName("SnowflakeIcebergReader")
      .master("local[*]")

# JAR Dependencies for Iceberg and Azure
      .config(
          "spark.jars.packages",
          f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION},"
          f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"
          # for Azure storage, use the below package and comment above azure bundle
          # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"
      )

      # Iceberg SQL Extensions
      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
      .config("spark.sql.defaultCatalog", CATALOG_NAME)

      # Horizon REST Catalog Configuration
      .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.type", "rest")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.uri", CATALOG_URI)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", CATALOG_NAME)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.token", ACCESS_TOKEN)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.scope", HORIZON_SESSION_ROLE)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.client.region", REGION)

      # Required for vended credentials
      .config(f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation", "vended-credentials")
      .config("spark.sql.iceberg.vectorization.enabled", "false")
      .getOrCreate()
  )
  spark.sparkContext.setLogLevel("ERROR")
  return spark
Copy

Where:

  • <accountidentifier> is your Snowflake account identifier for the Snowflake account that contains the Iceberg tables that you want to query. To find this identifier, see Before you begin.

  • <your_access_token> is your access token that you obtained. To obtain it, see Step 3: Obtain an access token for authentication.

    Note

    For External OAuth, alternatively, you can configure your connection to the engine with automatic token refresh instead of specifying an access token.

  • <database_name> is the name of the database in your Snowflake account that contains Snowflake-managed Iceberg tables that you want to query.

    Note

    The .warehouse property in Spark expects your Snowflake database name, not your Snowflake warehouse name.

  • <role> is the role in Snowflake that is configured with access to the Iceberg tables that you want to query. For example: DATA_ENGINEER.

Important

By default, the code example is set up for Apache Iceberg™ tables stored on Amazon S3. If your Iceberg tables are stored on Azure Storage (ADLS), perform the following steps:

  1. Comment out the following line: f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"

  2. Uncomment the following line: # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"

Connect by using a programmatic access token (PAT)

Use the following example code to connect the external query engine to Iceberg tables by using a programmatic access token (PAT):

# Snowflake Horizon Catalog Configuration, change as per your environment

CATALOG_URI = "https://<accountidentifier>.snowflakecomputing.com/polaris/api/catalog"
HORIZON_SESSION_ROLE = f"session:role:<role>"
CATALOG_NAME = "<database_name>" #provide in UPPER CASE

# Cloud Service Provider Region Configuration (where the Iceberg data is stored)
REGION = "eastus2"

# Paste the PAT you generated in Snowflake here
PAT_TOKEN = "<your_PAT_token>"

# Iceberg Version
ICEBERG_VERSION = "1.9.1"

def create_spark_session():
  """Create and configure Spark session for Snowflake Iceberg access."""
  spark = (
      SparkSession.builder
      .appName("SnowflakeIcebergReader")
      .master("local[*]")

# JAR Dependencies for Iceberg and Azure
      .config(
          "spark.jars.packages",
          f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION},"
          f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"
          # for Azure storage, use the below package and comment above azure bundle
          # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"
      )

      # Iceberg SQL Extensions
      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
      .config("spark.sql.defaultCatalog", CATALOG_NAME)

      # Horizon REST Catalog Configuration
      .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.type", "rest")
      .config(f"spark.sql.catalog.{CATALOG_NAME}.uri", CATALOG_URI)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", CATALOG_NAME)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.credential", ACCESS_TOKEN)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.scope", HORIZON_SESSION_ROLE)
      .config(f"spark.sql.catalog.{CATALOG_NAME}.client.region", REGION)

      # Required for vended credentials
      .config(f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation", "vended-credentials")
      .config("spark.sql.iceberg.vectorization.enabled", "false")
      .getOrCreate()
  )
  spark.sparkContext.setLogLevel("ERROR")
  return spark
Copy

Where:

  • <accountidentifier> is your Snowflake account identifier for the Snowflake account that contains the Iceberg tables that you want to query. To find this identifier, see Before you begin.

  • <your_PAT_token> is your PAT that you obtained. To obtain it, see Step 3: Obtain an access token for authentication.

  • <role> is the role in Snowflake that is configured with access to the Iceberg tables that you want to query. For example: DATA_ENGINEER.

  • <database_name> is the name of the database in your Snowflake account that contains Snowflake-managed Iceberg tables that you want to query.

    Note

    The .warehouse property in Spark expects your Snowflake database name, not your Snowflake warehouse name.

Important

By default, the code example is set up for Apache Iceberg™ tables stored on Amazon S3. If your Iceberg tables are stored on Azure Storage (ADLS), perform the following steps:

  1. Comment out the following line: f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}"

  2. Uncomment the following line: # f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}"

Step 5: Query Iceberg tables

This step provides the following code examples for using Apache Spark™ to query Iceberg tables:

  • Show namespaces

  • Use namespaces

  • Show tables

  • Query a table

Show namespaces

spark.sql("show namespaces").show()
Copy

Use namespace

spark.sql("use namespace <your_schema_name_in_snowflake>")
Copy

Show tables

spark.sql("show tables").show()
Copy

Query a table

spark.sql("use namespace spark_demo")
spark.sql("select * from <your_table_name_in_snowflake>").show()
Copy

Considerations for querying Iceberg tables with an external query engine

Consider the following items when you query Iceberg tables with an external query engine:

  • For tables in Snowflake:

    • Only Snowflake-managed Iceberg tables are supported.

    • Querying remote or externally managed Iceberg tables including Delta Direct and Parquet Direct tables and Snowflake native tables, aren’t supported.

  • You can query but can’t write to Iceberg tables.

  • The external reads are supported only on Iceberg version 2 or earlier.

  • This feature is only supported for Snowflake-managed Iceberg tables stored on Amazon S3, Google Cloud, or Azure for all public cloud regions. S3-compatible non-AWS storage is not yet supported.

  • You can’t query an Iceberg table through the Horizon Iceberg REST API if the following fine-grained access control (FGAC) policies are defined on the table:

    • Row access policies

    • Column-level security

  • Snowflake roles that include the hyphen character (-) in the role name aren’t supported when you access Iceberg tables through the Horizon Catalog endpoint.

  • Explicitly granting the Horizon Catalog endpoint access to your storage accounts isn’t supported. We recommend that you use private connectivity for secure connectivity from external engines to Horizon Catalog and from Horizon Catalog to storage account.