Snowflake Iceberg Catalog SDK¶
This topic covers how to configure and use the Snowflake Iceberg Catalog SDK. The SDK is available for Apache Iceberg versions 1.2.0 or later.
With the Snowflake Iceberg Catalog SDK, you can use the Apache Spark engine to query Iceberg tables.
Supported catalog operations¶
The SDK supports the following commands for browsing Iceberg metadata in Snowflake:
The SDK currently supports read operations (SELECT statements) only.
Install and connect¶
To install the Snowflake Iceberg Catalog SDK, download the latest version of the Iceberg libraries.
Before you can use the Snowflake Iceberg Catalog SDK, you need a Snowflake database with one or more Iceberg tables. To create an Iceberg table, see Create an Iceberg table.
After you establish a connection and the SDK confirms that Iceberg metadata is present, Snowflake accesses your Parquet data using the external volume that is associated with your Iceberg table(s).
To read table data with the SDK, start by configuring the following properties for your Spark cluster:
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.13:1.2.0,net.snowflake:snowflake-jdbc:3.13.28
# Configure a catalog named "snowflake_catalog" using the standard Iceberg SparkCatalog adapter
# Specify the implementation of the named catalog to be Snowflake's Catalog implementation
# Provide a Snowflake JDBC URI with which the Snowflake Catalog will perform low-level communication with Snowflake services
# Configure the Snowflake user on whose behalf to perform Iceberg metadata lookups
# Provide the user password. To configure the credentials, you can provide either password or private_key_file.
# Configure the private_key_file to use when connecting to Snowflake services; additional connection options can be found at https://docs.snowflake.com/en/user-guide/jdbc-configure.html
--conf spark.sql.catalog.snowflake_catalog.jdbc.private_key_file=<location of the private key>
You can use any Snowflake-supported JDBC driver connection parameter
in your configuration by using the following syntax:
After you configure your Spark cluster, you can check which tables are available to query. For example:
spark.sql("SHOW NAMESPACES IN my_database").show()
Then you can select a table to query.
spark.sql("SELECT * FROM my_database.my_schema.my_table WHERE ").show()
You can use the
DataFrame structure with languages like Python and Scala to query data.
df = spark.table("my_database.my_schema.my_table")
If you receive vectorized read errors while running queries, you can disable the vectorized reads for your session
spark.sql.iceberg.vectorization.enabled=false. To keep using vectorized reads,
work with your account team representative to set an account parameter.
When you issue a query, Snowflake caches the result within a certain time frame (90 seconds by default).
You might experience latency up to that duration. If you plan to access data programmatically for comparison purposes,
you can set the
spark.sql.catalog..cache-enabled property to
false to disable caching.
If your application is designed to tolerate a specific amount of latency, you can use the following property
to specify the latency period:
The following limitations apply to the Snowflake Iceberg Catalog SDK and are subject to change:
Only Apache Spark is supported for reading Iceberg tables.
Time travel queries are not supported.
You cannot use the SDK to access non-Iceberg Snowflake tables.