Add Parquet files in-place to a Snowflake Managed Apache Iceberg™ table¶
Bemerkung
This feature is in private preview. To request access, contact your Snowflake account team.
Zero-Copy Parquet to Managed Iceberg lets you register existing Apache Parquet files in-place with a
Snowflake-managed Apache Iceberg™ table without copying or moving the data. This private preview feature
uses the LOAD_MODE = ADD_FILES_REFERENCE option with the COPY INTO <Tabelle>
command or Snowpipe to detect new Parquet files in-place and generate corresponding Iceberg metadata.
Overview¶
When to use this feature¶
Use Zero-Copy Parquet to Managed Iceberg when:
You’re migrating from a Hive or Spark-based data lake: You have existing Parquet files in cloud storage and want to bring them into a fully managed Snowflake Iceberg table—with DML, optimization, and platform integrations—without reorganizing or copying your data.
External systems produce Parquet files: Your data pipelines (such as Apache Spark™, Apache Flink, or custom ETL jobs) write Parquet files to cloud storage in an append-only manner, and you want Snowflake to automatically track and query these files.
You want to minimize ingestion costs: Avoid duplicating large datasets by referencing files in-place instead of scanning and copying them into a table.
You need interoperable access: Perform read and write operations on Iceberg tables with Snowflake while still making the tables accessible to other engines.
You want Snowflake to fully manage and optimize storage: You want Snowflake to automatically perform table optimization – compaction and snapshot expiration.
When to use other ingestion methods¶
This feature isn’t the right choice for every scenario. Consider other ingestion methods when:
Your source files aren’t Parquet: Use the COPY INTO command with
LOAD_MODE = FULL_INGESTto load data from CSV, JSON, Avro, ORC, or other formats. See Übersicht zum Laden von Daten for supported formats.You need to transform data during ingestion: Use the COPY INTO command with
LOAD_MODE = FULL_INGESTand a SELECT statement to apply transformations as you load.Your data sources are streaming: If writers produce data from streaming sources like Apache Kafka instead of writing Parquet files, use Snowpipe Streaming for high-performance, low-latency ingestion into Iceberg tables.
External engines need to modify or delete data: If external engines need to directly update or delete Parquet files and not just append, consider Parquet Direct tables.
Benefits¶
When you use Zero-Copy Parquet to Managed Iceberg:
First-class Snowflake experience: Your registered Parquet files become a fully managed Iceberg table with DML support, automatic table optimization (compaction and snapshot expiration), and access to platform features like replication.
Instant Iceberg interoperability: Turn a bucket of Parquet files into an Iceberg table immediately. Snowflake generates the Iceberg metadata, making your data accessible to any Iceberg-compatible engine through the Horizon Catalog.
No data movement: Files remain in their original cloud storage location.
Significantly lower ingestion cost: Skip full ingestion processing.
No additional storage costs: Avoid duplicating data and associated storage charges.
Continuous sync: Use Snowpipe to automatically add new files as they arrive.
Requirements¶
When you use this option, the source Parquet files must meet the following conditions:
The files must be located under the base location of the table that you want to register them to.
The files must be Iceberg-compatible Apache Parquet files that aren’t registered with an Iceberg catalog.
The files must be in Apache Parquet format. No other file formats are supported.
Wichtig
External Parquet writers must be append-only. If external systems modify or delete files that have been registered with the Iceberg table, you might experience data inconsistencies or query failures.
How it works¶
Zero-Copy Parquet to Managed Iceberg supports use cases where external systems write Parquet files to cloud storage and Snowflake manages the Iceberg table metadata.
The following diagram illustrates the file layout in cloud storage:
When you create a Snowflake-managed Iceberg table with a BASE_LOCATION (the „User Path“), Snowflake
creates a sibling directory called the „Computed Path“ using the same path prefix plus a random
suffix (for example, my-table-1.abc123xyz). Snowflake manages the data/ and metadata/
subdirectories within the computed path. Your existing data lake Parquet files remain in the user
path (your BASE_LOCATION). When you use ADD_FILES_REFERENCE, Snowflake updates the Iceberg
metadata in the computed path to reference those existing Parquet files without moving them.
Any DML operations (INSERT, UPDATE, DELETE, MERGE) and table maintenance operations (compaction, snapshot expiration) write new data files to the computed path, keeping your original data lake files unchanged.
Access patterns¶
This feature is designed for the following access patterns:
System |
Access type |
Description |
|---|---|---|
External Parquet writers |
Append-only |
External engines (Spark, Flink, custom ETL) write new Parquet files to the table’s base location. These systems should only append new files and must not modify or delete existing files. |
Snowflake |
Read/write |
Full read/write access. You can query registered files, run DML operations (INSERT, UPDATE, DELETE, MERGE), and use table optimization features like compaction and clustering. |
External Iceberg readers |
Read-only |
External engines can read the Iceberg table through the Snowflake Horizon Catalog Iceberg REST API. These readers must use their own storage credentials; vended credentials are not yet supported for this use case. |
Syntax¶
Use the following syntax to register Parquet files with a Snowflake-managed Iceberg table using the
ADD_FILES_REFERENCE copy option.
Where:
Wichtig
The external stage must point to a path that is a subpath of, or equal to, the base location for your Iceberg table.
In your COPY INTO statement, you must specify a relative path to the Parquet files that you want to register. The relative path must be located under the base location of the table.
You must set
MATCH_BY_COLUMN_NAMEtoCASE_SENSITIVE.Registering Parquet files by using ADD_FILES_REFERENCE isn’t recommended if those files are already part of another Iceberg table. The best practice for converting externally managed Iceberg tables to Snowflake-managed Iceberg tables without rewriting files is to use the ALTER ICEBERG TABLE … CONVERT TO MANAGED command.
To view the full COPY INTO syntax with parameter descriptions, see COPY INTO <Tabelle>.
End-to-end example¶
This example demonstrates a complete workflow for registering existing Parquet files with a Snowflake-managed Iceberg table and configuring automatic ingestion for new files.
The example uses Amazon S3, but the same concepts apply to other supported cloud storage providers.
Prerequisites¶
Before you begin, ensure you have:
An existing data lake with Parquet files (for example, a Hive-partitioned dataset in Amazon S3).
Appropriate IAM permissions to create storage integrations and external volumes.
The Parquet files are Iceberg-compatible (standard Parquet format without Iceberg-specific metadata).
Step 1: Create an external volume¶
Create an external volume that points to the S3 location containing your Parquet files. The external volume must have write access enabled.
For detailed instructions, see Externes Volume konfigurieren.
Step 2: Create a storage integration and external stage¶
Create a storage integration and external stage that point to the same S3 location. The stage is used for the COPY INTO command.
For detailed instructions, see Option 1: Configure a Snowflake storage integration to access Amazon S3.
Step 3: Create an Iceberg table¶
You can create the Iceberg table with either a declarative schema or an inferred schema.
Option A: Create a table with a declarative schema
If you know the schema of your Parquet files, define it explicitly:
Option B: Create a table with an inferred schema
Use the INFER_SCHEMA function to detect the schema from your Parquet files:
Bemerkung
Using * for ARRAY_AGG(OBJECT_CONSTRUCT()) might result in an error if the
returned result is larger than 16 MB. For larger result sets, specify only the required columns:
COLUMN_NAME, TYPE, and NULLABLE.
Step 4: Register existing Parquet files¶
Use the COPY INTO command with LOAD_MODE = ADD_FILES_REFERENCE to register your existing
Parquet files with the Iceberg table:
Output:
Step 5: Continuously add new files with Snowpipe (optional)¶
To automatically register new Parquet files as they arrive, create a Snowpipe with auto-ingest enabled:
For detailed instructions on configuring event notifications for auto-ingest, see Automatisieren von Snowpipe für Amazon S3.
Step 6: Query the table¶
Query the table to verify that the data was registered successfully:
Monitoring¶
Use the following queries to monitor file registration and pipe status.
View file registration history¶
Use the COPY_HISTORY table function to see which files were successfully registered and which failed:
The status column indicates whether each file was successfully registered:
Loaded: The file was successfully registered with the table.Load failed: The file registration failed. Checkfirst_error_messagefor details.Partially loaded: Some rows were registered, but others failed.
List pipes using ADD_FILES_REFERENCE¶
Query the PIPES Account Usage view to find all pipes
configured to use LOAD_MODE = ADD_FILES_REFERENCE:
To check the current status of a specific pipe, use the PIPE_USAGE_HISTORY view or the SYSTEM$PIPE_STATUS function:
Supported features¶
The following features are supported with Zero-Copy Parquet to Managed Iceberg tables:
Clustering: Define clustering keys to optimize query performance.
Table optimization (compaction, snapshot expiration): Automatic and manual compaction are supported. See testing best practices for important considerations.
Snowpipe with auto-ingest: Use
LOAD_MODE = ADD_FILES_REFERENCEin the pipe definition.Dynamic Tables: The Iceberg table can be used as the source for dynamic tables or dynamic Iceberg tables.
Streams: Standard and append-only streams are supported.
Replication: The Iceberg table can be replicated along with the stage, pipe, storage integration, and external volume.
Horizon Catalog Iceberg REST API: External engines can read the table. Vended credentials are not supported; readers must use their own storage credentials.
Listing Auto Fulfillment: The Iceberg table can be used with Listing Auto Fulfillment.
Iceberg v3: The target Iceberg table can be created as or upgraded to Iceberg v3.
Schema evolution: The Iceberg table supports
ENABLE_SCHEMA_EVOLUTION.Partitioning: Single-argument Iceberg partitioning (
PARTITION BY) is supported for Snowflake DML operations.Materialized views: You can create materialized views on the Iceberg table.
Secure views: You can create secure views on the Iceberg table.
Sharing: The Iceberg table can be shared.
Data protection policies: Masking, row access, projection, aggregation, privacy, and join policies are supported, including enforcement when accessing from Apache Spark.
Snowflake government regions: This feature is available in Snowflake government regions.
Limitations¶
Consider the following limitations when using Zero-Copy Parquet to Managed Iceberg:
File format limitations¶
Parquet only: Only Apache Parquet files are supported. No other file formats (ORC, Avro, CSV, JSON) can be registered using this feature.
Iceberg-compatible Parquet: The Parquet files must be compatible with Iceberg. Files that contain Iceberg-specific row lineage metadata (
_row_idand_last_updated_sequence_number) require theFULL_INGESToption instead.Vectorized scanner required: You must use
FILE_FORMAT = (TYPE = PARQUET USE_VECTORIZED_SCANNER = TRUE).
Table type limitations¶
Snowflake-managed Iceberg tables only: This feature works only with Snowflake-managed Iceberg tables (
CATALOG = 'SNOWFLAKE'). Externally managed Iceberg tables are not supported.Files must be under BASE_LOCATION: All Parquet files must be located under the table’s BASE_LOCATION path in cloud storage.
COPY INTO options not supported¶
The following COPY INTO options are not supported with LOAD_MODE = ADD_FILES_REFERENCE:
Copying unstaged data by specifying a cloud storage location and a storage integration directly.
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE | NONE(must useCASE_SENSITIVE).ON_ERROR = CONTINUE | SKIP_FILE_N | SKIP_FILE_X%.PURGE = TRUE(source files cannot be purged).Transforming or filtering the data before loading. To transform data, use the
FULL_INGESToption instead.
Testing best practices¶
Follow these guidelines when testing Zero-Copy Parquet to Managed Iceberg to avoid data loss or inconsistencies.
Warnung
Do not test ADD_FILES_REFERENCE on files that are also used by Parquet Direct tables
or External Tables. After files are registered and table optimization runs, the files might be
moved or modified, causing issues with other tables referencing the same files.
Why this matters¶
When you use ADD_FILES_REFERENCE, Snowflake registers your Parquet files in the Iceberg
table metadata. If table optimization (compaction) runs, Snowflake might:
Combine multiple small files into larger files
Move files from their original location to the Snowflake-managed
data/subdirectoryRemove the original files after compaction
This behavior is expected for production use but can cause problems during testing if you’re comparing this feature against other approaches using the same source files.
Safe testing approaches¶
Option 1: Copy files to a separate location before testing
Create a copy of your Parquet files in a separate directory before testing:
AWS S3:
Azure Blob Storage:
Google Cloud Storage:
Option 2: Disable compaction during testing
If copying files is not practical, disable automatic compaction on the test table:
Additionally, avoid running any DML operations (INSERT, UPDATE, DELETE, MERGE) during testing, as these can also trigger file modifications.
Bemerkung
Remember to re-enable compaction after testing is complete:
Considerations¶
Consider the following when using ADD_FILES_REFERENCE:
You must fulfill the following prerequisites when using the
LOAD_MODE = ADD_FILES_REFERENCEoption:The target table must be a Snowflake-managed Iceberg table with column data types that are compatible with the source Parquet file data types. For more information, see Datentypen für Apache Iceberg™-Tabellen.
The source file format type must be Iceberg-compatible Parquet, and you must use a vectorized scanner:
FILE_FORMAT = (TYPE = PARQUET USE_VECTORIZED_SCANNER = TRUE).You must set the
MATCH_BY_COLUMN_NAMEoption toCASE_SENSITIVE.
Parquet files that are already part of another Iceberg table should not be registered using
ADD_FILES_REFERENCE. For best practices on converting externally managed Iceberg tables to Snowflake-managed tables without rewriting files, see ALTER ICEBERG TABLE … CONVERT TO MANAGED.For Parquet files that contain row-lineage metadata columns (
_row_idand_last_updated_sequence_number), you must use theFULL_INGESToption. These files are likely already part of an Iceberg v3 table. For more information, see the documentation for loading data into Iceberg tables.