Add Parquet files in-place to a Snowflake Managed Apache Iceberg™ table¶

Note

This feature is in private preview. To request access, contact your Snowflake account team.

Zero-Copy Parquet to Managed Iceberg lets you register existing Apache Parquet files in-place with a Snowflake-managed Apache Iceberg™ table without copying or moving the data. This private preview feature uses the LOAD_MODE = ADD_FILES_REFERENCE option with the COPY INTO <table> command or Snowpipe to detect new Parquet files in-place and generate corresponding Iceberg metadata.

Overview¶

When to use this feature¶

Use Zero-Copy Parquet to Managed Iceberg when:

You’re migrating from a Hive or Spark-based data lake: You have existing Parquet files in cloud storage and want to bring them into a fully managed Snowflake Iceberg table—with DML, optimization, and platform integrations—without reorganizing or copying your data.
External systems produce Parquet files: Your data pipelines (such as Apache Spark™, Apache Flink, or custom ETL jobs) write Parquet files to cloud storage in an append-only manner, and you want Snowflake to automatically track and query these files.
You want to minimize ingestion costs: Avoid duplicating large datasets by referencing files in-place instead of scanning and copying them into a table.
You need interoperable access: Perform read and write operations on Iceberg tables with Snowflake while still making the tables accessible to other engines.
You want Snowflake to fully manage and optimize storage: You want Snowflake to automatically perform table optimization – compaction and snapshot expiration.

When to use other ingestion methods¶

This feature isn’t the right choice for every scenario. Consider other ingestion methods when:

Your source files aren’t Parquet: Use the COPY INTO command with LOAD_MODE = FULL_INGEST to load data from CSV, JSON, Avro, ORC, or other formats. See Vue d’ensemble du chargement de données for supported formats.
You need to transform data during ingestion: Use the COPY INTO command with LOAD_MODE = FULL_INGEST and a SELECT statement to apply transformations as you load.
Your data sources are streaming: If writers produce data from streaming sources like Apache Kafka instead of writing Parquet files, use Snowpipe Streaming for high-performance, low-latency ingestion into Iceberg tables.
External engines need to modify or delete data: If external engines need to directly update or delete Parquet files and not just append, consider Parquet Direct tables.

Benefits¶

When you use Zero-Copy Parquet to Managed Iceberg:

First-class Snowflake experience: Your registered Parquet files become a fully managed Iceberg table with DML support, automatic table optimization (compaction and snapshot expiration), and access to platform features like replication.
Instant Iceberg interoperability: Turn a bucket of Parquet files into an Iceberg table immediately. Snowflake generates the Iceberg metadata, making your data accessible to any Iceberg-compatible engine through the Horizon Catalog.
No data movement: Files remain in their original cloud storage location.
Significantly lower ingestion cost: Skip full ingestion processing.
No additional storage costs: Avoid duplicating data and associated storage charges.
Continuous sync: Use Snowpipe to automatically add new files as they arrive.

Requirements¶

When you use this option, the source Parquet files must meet the following conditions:

The files must be located under the base location of the table that you want to register them to.
The files must be Iceberg-compatible Apache Parquet files that aren’t registered with an Iceberg catalog.
The files must be in Apache Parquet format. No other file formats are supported.

Important

External Parquet writers must be append-only. If external systems modify or delete files that have been registered with the Iceberg table, you might experience data inconsistencies or query failures.

How it works¶

Zero-Copy Parquet to Managed Iceberg supports use cases where external systems write Parquet files to cloud storage and Snowflake manages the Iceberg table metadata.

The following diagram illustrates the file layout in cloud storage:

s3://my-bucket/my-data-lake/
├── my-table-1/                                    <-- User Path (BASE_LOCATION you specify)
│   ├── year=2024/                                 <-- Existing Parquet from your data lake
│   │   ├── month=01/
│   │   │   └── data_001.parquet
│   │   └── month=02/
│   │       └── data_002.parquet
│   └── year=2025/
│       └── month=01/
│           └── data_003.parquet
└── my-table-1.abc123xyz/                                <-- Computed Path (Snowflake creates this)
    ├── data/                                      <-- Snowflake DML writes here
    │   └── ...
    └── metadata/                                  <-- Iceberg metadata files
        ├── v1.metadata.json
        └── ...

When you create a Snowflake-managed Iceberg table with a BASE_LOCATION (the « User Path »), Snowflake creates a sibling directory called the « Computed Path » using the same path prefix plus a random suffix (for example, my-table-1.abc123xyz). Snowflake manages the data/ and metadata/ subdirectories within the computed path. Your existing data lake Parquet files remain in the user path (your BASE_LOCATION). When you use ADD_FILES_REFERENCE, Snowflake updates the Iceberg metadata in the computed path to reference those existing Parquet files without moving them.

Any DML operations (INSERT, UPDATE, DELETE, MERGE) and table maintenance operations (compaction, snapshot expiration) write new data files to the computed path, keeping your original data lake files unchanged.

Access patterns¶

This feature is designed for the following access patterns:


System	Access type	Description
External Parquet writers	Append-only	External engines (Spark, Flink, custom ETL) write new Parquet files to the table’s base location. These systems should only append new files and must not modify or delete existing files.
Snowflake	Read/write	Full read/write access. You can query registered files, run DML operations (INSERT, UPDATE, DELETE, MERGE), and use table optimization features like compaction and clustering.
External Iceberg readers	Read-only	External engines can read the Iceberg table through the Snowflake Horizon Catalog Iceberg REST API. These readers must use their own storage credentials; vended credentials are not yet supported for this use case.

Syntax¶

Use the following syntax to register Parquet files with a Snowflake-managed Iceberg table using the ADD_FILES_REFERENCE copy option.

COPY INTO [<namespace>.]<table_name>
  FROM externalStage
  FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>'
    | TYPE = PARQUET  USE_VECTORIZED_SCANNER = TRUE } )
  LOAD_MODE = ADD_FILES_REFERENCE
  MATCH_BY_COLUMN_NAME = CASE_SENSITIVE

Where:

externalStage ::=
  @[<namespace>.]<ext_stage_name>/<relative-path-from-table-base-location>

Important

The external stage must point to a path that is a subpath of, or equal to, the base location for your Iceberg table.
In your COPY INTO statement, you must specify a relative path to the Parquet files that you want to register. The relative path must be located under the base location of the table.
You must set MATCH_BY_COLUMN_NAME to CASE_SENSITIVE.
Registering Parquet files by using ADD_FILES_REFERENCE isn’t recommended if those files are already part of another Iceberg table. The best practice for converting externally managed Iceberg tables to Snowflake-managed Iceberg tables without rewriting files is to use the ALTER ICEBERG TABLE … CONVERT TO MANAGED command.

To view the full COPY INTO syntax with parameter descriptions, see COPY INTO <table>.

End-to-end example¶

This example demonstrates a complete workflow for registering existing Parquet files with a Snowflake-managed Iceberg table and configuring automatic ingestion for new files.

The example uses Amazon S3, but the same concepts apply to other supported cloud storage providers.

Prerequisites¶

Before you begin, ensure you have:

An existing data lake with Parquet files (for example, a Hive-partitioned dataset in Amazon S3).
Appropriate IAM permissions to create storage integrations and external volumes.
The Parquet files are Iceberg-compatible (standard Parquet format without Iceberg-specific metadata).

Step 1: Create an external volume¶

Create an external volume that points to the S3 location containing your Parquet files. The external volume must have write access enabled.

CREATE OR REPLACE EXTERNAL VOLUME my_external_volume
  STORAGE_LOCATIONS = (
    (
      NAME = 'my-s3-location'
      STORAGE_PROVIDER = 'S3'
      STORAGE_BASE_URL = 's3://my-bucket/my-data-lake/'
      STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/my-role'
    )
  )
  ALLOW_WRITES = TRUE;

For detailed instructions, see Configurer un volume externe.

Step 2: Create a storage integration and external stage¶

Create a storage integration and external stage that point to the same S3 location. The stage is used for the COPY INTO command.

CREATE OR REPLACE STORAGE INTEGRATION my_storage_integration
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  STORAGE_ALLOWED_LOCATIONS = ('s3://my-bucket/my-data-lake/')
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/my-role'
  ENABLED = TRUE;

CREATE OR REPLACE FILE FORMAT my_parquet_format
  TYPE = PARQUET
  USE_VECTORIZED_SCANNER = TRUE;

CREATE OR REPLACE STAGE my_parquet_stage
  STORAGE_INTEGRATION = my_storage_integration
  URL = 's3://my-bucket/my-data-lake/'
  FILE_FORMAT = my_parquet_format;

For detailed instructions, see Option 1: Configure a Snowflake storage integration to access Amazon S3.

Step 3: Create an Iceberg table¶

You can create the Iceberg table with either a declarative schema or an inferred schema.

Option A: Create a table with a declarative schema

If you know the schema of your Parquet files, define it explicitly:

CREATE OR REPLACE ICEBERG TABLE my_table (
  id INTEGER,
  name STRING,
  created_date DATE,
  amount DOUBLE
)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'my_external_volume'
  BASE_LOCATION = 'my-table-1/';

Option B: Create a table with an inferred schema

Use the INFER_SCHEMA function to detect the schema from your Parquet files:

-- First, preview the inferred schema
SELECT *
  FROM TABLE(
    INFER_SCHEMA(
      LOCATION => '@my_parquet_stage/my-table-1/',
      FILE_FORMAT => 'my_parquet_format',
      KIND => 'ICEBERG'
    )
  );

-- Create the table using the inferred schema
CREATE OR REPLACE ICEBERG TABLE my_table
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      WITHIN GROUP (ORDER BY order_id)
      FROM TABLE(
        INFER_SCHEMA(
          LOCATION => '@my_parquet_stage/my-table-1/',
          FILE_FORMAT => 'my_parquet_format',
          KIND => 'ICEBERG'
        )
      )
  )
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'my_external_volume'
  BASE_LOCATION = 'my-table-1/';

Note

Using * for ARRAY_AGG(OBJECT_CONSTRUCT()) might result in an error if the returned result is larger than 16 MB. For larger result sets, specify only the required columns: COLUMN_NAME, TYPE, and NULLABLE.

Step 4: Register existing Parquet files¶

Use the COPY INTO command with LOAD_MODE = ADD_FILES_REFERENCE to register your existing Parquet files with the Iceberg table:

COPY INTO my_table
  FROM @my_parquet_stage/my-table-1/
  FILE_FORMAT = (TYPE = PARQUET USE_VECTORIZED_SCANNER = TRUE)
  LOAD_MODE = ADD_FILES_REFERENCE
  MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

Output:

+-----------------------------------------------+--------+-------------+-------------+
| file                                          | status | rows_parsed | rows_loaded |
|-----------------------------------------------+--------+-------------+-------------|
| my-table-1/year=2024/month=01/data_001.parquet | LOADED |       10000 |       10000 |
| my-table-1/year=2024/month=02/data_002.parquet | LOADED |       15000 |       15000 |
+-----------------------------------------------+--------+-------------+-------------+

Step 5: Continuously add new files with Snowpipe (optional)¶

To automatically register new Parquet files as they arrive, create a Snowpipe with auto-ingest enabled:

CREATE OR REPLACE PIPE my_table_pipe
  AUTO_INGEST = TRUE
  AS
    COPY INTO my_table
      FROM @my_parquet_stage/my-table-1/
      FILE_FORMAT = (TYPE = PARQUET USE_VECTORIZED_SCANNER = TRUE)
      LOAD_MODE = ADD_FILES_REFERENCE
      MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;

For detailed instructions on configuring event notifications for auto-ingest, see Automatisation de Snowpipe pour Amazon S3.

Step 6: Query the table¶

Query the table to verify that the data was registered successfully:

SELECT * FROM my_table LIMIT 10;

Monitoring¶

Use the following queries to monitor file registration and pipe status.

View file registration history¶

Use the COPY_HISTORY table function to see which files were successfully registered and which failed:

-- Show recent file registration history for a table
SELECT
    file_name,
    status,
    row_count,
    first_error_message,
    last_load_time
  FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
    TABLE_NAME => 'my_table',
    START_TIME => DATEADD(hours, -24, CURRENT_TIMESTAMP())
  ))
  ORDER BY last_load_time DESC;

The status column indicates whether each file was successfully registered:

Loaded: The file was successfully registered with the table.
Load failed: The file registration failed. Check first_error_message for details.
Partially loaded: Some rows were registered, but others failed.

List pipes using ADD_FILES_REFERENCE¶

Query the PIPES Account Usage view to find all pipes configured to use LOAD_MODE = ADD_FILES_REFERENCE:

-- List pipes using ADD_FILES_REFERENCE with their target tables and status
SELECT
    pipe_catalog AS database_name,
    pipe_schema AS schema_name,
    pipe_name,
    definition,
    is_autoingest_enabled,
    notification_channel_name
  FROM SNOWFLAKE.ACCOUNT_USAGE.PIPES
  WHERE UPPER(definition) LIKE '%ADD_FILES_REFERENCE%'
  ORDER BY created;

To check the current status of a specific pipe, use the PIPE_USAGE_HISTORY view or the SYSTEM$PIPE_STATUS function:

-- Check the status of a specific pipe
SELECT SYSTEM$PIPE_STATUS('my_table_pipe');

Supported features¶

The following features are supported with Zero-Copy Parquet to Managed Iceberg tables:

Clustering: Define clustering keys to optimize query performance.
Table optimization (compaction, snapshot expiration): Automatic and manual compaction are supported. See testing best practices for important considerations.
Snowpipe with auto-ingest: Use LOAD_MODE = ADD_FILES_REFERENCE in the pipe definition.
Dynamic Tables: The Iceberg table can be used as the source for dynamic tables or dynamic Iceberg tables.
Streams: Standard and append-only streams are supported.
Replication: The Iceberg table can be replicated along with the stage, pipe, storage integration, and external volume.
Horizon Catalog Iceberg REST API: External engines can read the table. Vended credentials are not supported; readers must use their own storage credentials.
Listing Auto Fulfillment: The Iceberg table can be used with Listing Auto Fulfillment.
Iceberg v3: The target Iceberg table can be created as or upgraded to Iceberg v3.
Schema evolution: The Iceberg table supports ENABLE_SCHEMA_EVOLUTION.
Partitioning: Single-argument Iceberg partitioning (PARTITION BY) is supported for Snowflake DML operations.
Materialized views: You can create materialized views on the Iceberg table.
Secure views: You can create secure views on the Iceberg table.
Sharing: The Iceberg table can be shared.
Data protection policies: Masking, row access, projection, aggregation, privacy, and join policies are supported, including enforcement when accessing from Apache Spark.
Snowflake government regions: This feature is available in Snowflake government regions.

Limitations¶

Consider the following limitations when using Zero-Copy Parquet to Managed Iceberg:

File format limitations¶

Parquet only: Only Apache Parquet files are supported. No other file formats (ORC, Avro, CSV, JSON) can be registered using this feature.
Iceberg-compatible Parquet: The Parquet files must be compatible with Iceberg. Files that contain Iceberg-specific row lineage metadata (_row_id and _last_updated_sequence_number) require the FULL_INGEST option instead.
Vectorized scanner required: You must use FILE_FORMAT = (TYPE = PARQUET USE_VECTORIZED_SCANNER = TRUE).

Table type limitations¶

Snowflake-managed Iceberg tables only: This feature works only with Snowflake-managed Iceberg tables (CATALOG = 'SNOWFLAKE'). Externally managed Iceberg tables are not supported.
Files must be under BASE_LOCATION: All Parquet files must be located under the table’s BASE_LOCATION path in cloud storage.

COPY INTO options not supported¶

The following COPY INTO options are not supported with LOAD_MODE = ADD_FILES_REFERENCE:

Copying unstaged data by specifying a cloud storage location and a storage integration directly.
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE | NONE (must use CASE_SENSITIVE).
ON_ERROR = CONTINUE | SKIP_FILE_N | SKIP_FILE_X%.
PURGE = TRUE (source files cannot be purged).
Transforming or filtering the data before loading. To transform data, use the FULL_INGEST option instead.

Testing best practices¶

Follow these guidelines when testing Zero-Copy Parquet to Managed Iceberg to avoid data loss or inconsistencies.

Avertissement

Do not test ADD_FILES_REFERENCE on files that are also used by Parquet Direct tables or External Tables. After files are registered and table optimization runs, the files might be moved or modified, causing issues with other tables referencing the same files.

Why this matters¶

When you use ADD_FILES_REFERENCE, Snowflake registers your Parquet files in the Iceberg table metadata. If table optimization (compaction) runs, Snowflake might:

Combine multiple small files into larger files
Move files from their original location to the Snowflake-managed data/ subdirectory
Remove the original files after compaction

This behavior is expected for production use but can cause problems during testing if you’re comparing this feature against other approaches using the same source files.

Safe testing approaches¶

Option 1: Copy files to a separate location before testing

Create a copy of your Parquet files in a separate directory before testing:

AWS S3:

aws s3 cp s3://my-bucket/production-data/ s3://my-bucket/test-data/ --recursive

Azure Blob Storage:

az storage blob copy start-batch \
  --source-container production-data \
  --destination-container test-data \
  --account-name mystorageaccount

Google Cloud Storage:

gsutil -m cp -r gs://my-bucket/production-data/* gs://my-bucket/test-data/

Option 2: Disable compaction during testing

If copying files is not practical, disable automatic compaction on the test table:

ALTER ICEBERG TABLE my_test_table
  SET ENABLE_DATA_COMPACTION = FALSE;

Additionally, avoid running any DML operations (INSERT, UPDATE, DELETE, MERGE) during testing, as these can also trigger file modifications.

Note

Remember to re-enable compaction after testing is complete:

ALTER ICEBERG TABLE my_test_table
  SET ENABLE_DATA_COMPACTION = TRUE;

Considerations¶

Consider the following when using ADD_FILES_REFERENCE:

You must fulfill the following prerequisites when using the LOAD_MODE = ADD_FILES_REFERENCE option:
- The target table must be a Snowflake-managed Iceberg table with column data types that are compatible with the source Parquet file data types. For more information, see Types de données des tables Apache Iceberg™.
- The source file format type must be Iceberg-compatible Parquet, and you must use a vectorized scanner: FILE_FORMAT = (TYPE = PARQUET USE_VECTORIZED_SCANNER = TRUE).
- You must set the MATCH_BY_COLUMN_NAME option to CASE_SENSITIVE.
Parquet files that are already part of another Iceberg table should not be registered using ADD_FILES_REFERENCE. For best practices on converting externally managed Iceberg tables to Snowflake-managed tables without rewriting files, see ALTER ICEBERG TABLE … CONVERT TO MANAGED.
For Parquet files that contain row-lineage metadata columns (_row_id and _last_updated_sequence_number), you must use the FULL_INGEST option. These files are likely already part of an Iceberg v3 table. For more information, see the documentation for loading data into Iceberg tables.