Apache Iceberg tables with Snowpark Connect for Spark¶
Snowpark Connect for Spark supports reading and writing Apache Iceberg™ tables through the standard Spark DataFrame read/write APIs. The same APIs work for both Snowflake-managed and externally managed Iceberg tables. For general information about Iceberg tables in Snowflake, see Apache Iceberg™ tables.
Setup¶
Before you can create or write to Iceberg tables, configure an external volume and link it to your session in one of the following ways:
Set the EXTERNAL_VOLUME property on the database.
Set
snowpark.connect.iceberg.external_volumein your Spark configuration:
Reading Iceberg tables¶
Use the standard Spark table reader. The argument to load() is a Snowflake table identifier
(for example, my_database.my_schema.my_table), not a file path or catalog URI:
For externally managed tables, you must first create a Snowflake table entity that points to the external catalog before you can read from it. See Create an Apache Iceberg™ table in Snowflake.
Writing with the V1 API (DataFrameWriter)¶
The V1 DataFrameWriter API is the recommended way to write Iceberg tables. For the full list of modes,
see the
Spark DataFrameWriter documentation.
For dynamic partition overwrite, combine mode("overwrite") with option("overwrite-mode", "dynamic")
and partitionBy().
Standard SQL statements such as INSERT INTO and INSERT OVERWRITE also work.
Write options¶
Option |
Description |
|---|---|
|
Base storage location for table data. Aliases: |
|
Catalog to use (defaults to |
|
Controls how data files are serialized. Alias: |
|
Target size for written data files. Valid values: |
|
Set to |
|
Set to |
Writing with the V2 API (DataFrameWriterV2)¶
The V2 DataFrameWriterV2 API is partially supported. The following operations work:
create, append, replace, createOrReplace, overwrite, and
overwritePartitions. However, several methods behave differently from open-source Spark.
Tip
For the most predictable behavior, use the V1 DataFrameWriter API for Iceberg writes. The V1 API
has full support for write modes, schema evolution, dynamic partition overwrite, and write options.
Snowflake-managed vs. externally managed tables¶
The read and write APIs are the same for both table types. The differences are in table creation and overwrite behavior:
Snowflake-managed tables: You can create tables through Snowpark Connect for Spark using the DataFrame write APIs. See Create an Apache Iceberg™ table in Snowflake.
Externally managed tables: An externally managed Iceberg table is one whose metadata lives in an external catalog (for example, AWS Glue or Polaris) rather than in Snowflake’s own catalog. The following applies only to those tables:
Create the table outside of Spark. Snowpark Connect for Spark can’t create externally managed tables on your behalf. Create the table once with SQL before any Spark code reads from or writes to it. See Create an Apache Iceberg™ table in Snowflake.
Overwrite uses an automatic fallback. A normal Spark overwrite would issue table recreation, which Snowflake rejects on externally managed tables. Snowpark Connect for Spark instead truncates the table and inserts new data. The table definition and its catalog integration are preserved.
Schema evolution (
mergeSchema):Append: Supported. New columns in your DataFrame are added to the table with ALTER ICEBERG TABLE … ADD COLUMN, provided Snowflake has permission to alter the table through the external catalog.
Overwrite: Not supported. The DataFrame schema must already match the table. If it doesn’t, evolve the schema in the external catalog first or use a Snowflake-managed Iceberg table.
Known limitations¶
Using Spark SQL to create Iceberg tables isn’t supported. Use the DataFrame write APIs instead.
Schema evolution within STRUCT columns isn’t supported yet.
mergeSchemacan add new top-level columns but can’t add or modify nested fields inside an existing STRUCT column.Time travel isn’t supported for either table type, including historical snapshot reads, branch reads, and incremental reads.