Optimize input data for dynamic tables¶
This page explains how primary keys and clustering help Snowflake process fewer rows during each refresh. Pipeline developers should read this page before building pipelines over base tables that use INSERT OVERWRITE or other full-replacement loading patterns.
Add a primary key with the RELY property to your base table. This lets Snowflake detect which rows actually changed after an INSERT OVERWRITE, rather than treating every row as new:
How primary keys reduce refresh work¶
During each refresh, Snowflake must determine which rows changed in the base tables. Without a primary key, an INSERT OVERWRITE replaces all change-tracking metadata, and Snowflake treats every row as new. This forces a full refresh even when only a small fraction of the data actually changed.
When a base table has a PRIMARY KEY constraint with the RELY property, Snowflake uses those key values as stable row identifiers across updates. It compares primary key values before and after the overwrite, identifies only the rows that actually changed, and processes those rows through downstream dynamic tables.
RELY does not enforce uniqueness
RELY does not prevent future duplicates. Validate uniqueness in your upstream ETL before setting RELY, and repeat the check after every load cycle.
Types of primary keys for dynamic tables¶
Snowflake recognizes two kinds of primary key for change tracking. The following table summarizes when each applies.
| Type | How it is created | When to use |
|---|---|---|
| Base table PRIMARY KEY with RELY | You declare the constraint on the base table and set RELY. | Base tables loaded by INSERT OVERWRITE, COPY INTO, or external ETL. |
| System-derived unique key | Snowflake infers it from the dynamic table’s definition (GROUP BY or QUALIFY ROW_NUMBER() = 1). | Dynamic tables that produce one row per key by construction. |
Base table primary key with RELY¶
When a base table has a PRIMARY KEY constraint with the RELY property, Snowflake uses that key for row-level change tracking in all downstream dynamic tables. This is the primary mechanism for optimizing INSERT OVERWRITE workloads.
System-derived unique key¶
Snowflake can automatically derive a unique key from a dynamic table’s definition. The following constructs produce a system-derived unique key:
- GROUP BY: The grouping columns form the key because each group produces exactly one output row.
- QUALIFY ROW_NUMBER() = 1: The partition-by columns form the key because the filter keeps exactly one row per partition.
- Base table primary key passthrough: If the dynamic table definition passes through a RELY primary key column without applying functions, casts, or expressions, the system-derived unique key propagates to the dynamic table.
To check whether a dynamic table has a system-derived unique key, run SHOW UNIQUE KEYS:
If the query returns no rows, the dynamic table doesn’t have a system-derived unique key. Downstream dynamic tables can use incremental refresh against a full-refresh upstream when at least one of these conditions is met:
- The upstream has a system-derived unique key.
- The upstream has an IMMUTABLE WHERE constraint.
Otherwise, creating a downstream table with REFRESH_MODE = INCREMENTAL fails, and REFRESH_MODE = AUTO resolves to FULL.
Optimize INSERT OVERWRITE workloads¶
INSERT OVERWRITE is the most common loading pattern that benefits from primary keys. Without a primary key, every INSERT OVERWRITE looks like a full delete-and-reinsert to Snowflake, even when only a few rows changed.
The following example shows the recommended pattern. The base table declares a PRIMARY KEY with RELY, and the downstream dynamic table uses incremental refresh:
When the external ETL process runs INSERT OVERWRITE on dim_customers, Snowflake compares primary key
values before and after the overwrite, identifies only the changed dimension rows, and refreshes only the
fact rows that join to those changed dimensions.
Note
If adding a primary key to your base table isn’t feasible, consider using ADAPTIVE refresh as an alternative. ADAPTIVE can handle INSERT OVERWRITE workloads without requiring primary keys on the base table.
How base table replacement methods affect primary keys
Not all table-replacement patterns preserve the primary key constraint and change-tracking history.
| Method | Primary key preserved? | Change history preserved? | Atomic? | Effect on downstream dynamic tables |
|---|---|---|---|---|
| INSERT … OVERWRITE (recommended) | Yes | Yes | Yes | Only changed rows are processed. |
| TRUNCATE + INSERT/COPY INTO | Yes | Yes | No (table is empty between the two statements). | Same as INSERT OVERWRITE once the load completes, but a refresh that runs mid-load sees an empty table. |
| CREATE OR REPLACE TABLE | No (primary key is dropped unless redeclared). | No (all streams become stale). | Yes | Full reinitialization required. Even if you redeclare the primary key, there is no before-state to compare against on the initial refresh. |
Use INSERT OVERWRITE for base tables that feed dynamic table pipelines. Avoid CREATE OR REPLACE on tables that already have downstream dynamic tables.
Enable incremental refresh downstream of full refresh¶
Normally, a dynamic table with REFRESH_MODE = INCREMENTAL can’t read from a dynamic table with
REFRESH_MODE = FULL, because the full refresh discards change-tracking metadata. When the upstream full
refresh dynamic table has a system-derived unique key, Snowflake can compute the differences between full
refreshes, and downstream tables can refresh incrementally.
Set REFRESH_MODE = INCREMENTAL explicitly
To use incremental refresh downstream of a full refresh dynamic table with a system-derived unique key,
you must explicitly set REFRESH_MODE = INCREMENTAL on the downstream table. Setting REFRESH_MODE = AUTO
still resolves to FULL in this scenario.
Example: GROUP BY produces a system-derived unique key¶
Verify the system-derived unique key¶
Before creating a downstream incremental dynamic table, verify that the upstream table has a system-derived unique key:
If SHOW UNIQUE KEYS returns no rows, the downstream CREATE with REFRESH_MODE = INCREMENTAL fails:
Pair primary keys with clustering¶
Primary keys tell Snowflake which rows changed. Clustering determines how efficiently Snowflake can locate those rows. When the base table is clustered on or near the primary key columns, Snowflake can typically prune more micro-partitions during refresh and read less data.
For the best results:
- Cluster base tables on the primary key column or the column most frequently used in join conditions with downstream dynamic tables.
- If the base table already has a clustering key for query performance, check whether it overlaps with the primary key. Overlapping keys benefit both query pruning and refresh pruning.
- Be aware that background reclustering creates new micro-partitions, which can cause temporary spikes in refresh duration. These spikes are transient and resolve after reclustering completes.
For general clustering guidance, see Micro-partitions & Data Clustering.
Detect primary-key tracking degradation¶
Primary key-based change tracking can degrade or stop working without producing an error. Snowflake doesn’t raise an exception in these cases. Instead, it falls back to standard change-tracking columns, which eliminates the performance benefit of primary key-based tracking.
| Scenario | What happens | How to detect |
|---|---|---|
| Duplicate primary key values in the base table | Change tracking may produce unexpected results (rows miscounted or missed). | Run a GROUP BY / HAVING query on the primary key column before setting RELY. |
| Masking policy on a primary key column | Snowflake can’t read the key values and falls back to standard change-tracking columns. | Check whether masking policies are applied to any primary key column. |
| ALTER TABLE drops the primary key constraint | The system-derived unique key disappears from downstream dynamic tables. | Run SHOW UNIQUE KEYS after schema changes. |
Masking policy on a primary key column¶
When a masking policy obfuscates a primary key column, Snowflake can’t use that column for change tracking. The dynamic table continues to refresh, but Snowflake falls back to standard change-tracking columns. For INSERT OVERWRITE workloads, this fallback means every refresh processes all rows, because INSERT OVERWRITE resets the standard tracking columns. For normal DML operations (INSERT, UPDATE, DELETE), the standard change-tracking columns can still detect row-level changes.
To detect this, check whether any masking policies are applied to primary key columns:
If a masking policy is applied to a primary key column, primary key-based change tracking is disabled regardless of SHOW UNIQUE KEYS output. Remove the masking policy from the primary key column or move it to a non-key column.
Try it: Compare refresh with and without primary keys¶
This tutorial builds two pipelines with the same data and the same join query. The only difference is whether the dimension table has a primary key. You then simulate a dimension table rewrite with INSERT OVERWRITE and compare the refresh performance.
Prerequisites¶
- A Snowflake account with privileges to create databases, schemas, and dynamic tables.
- An X-Small warehouse. The tutorial uses
transform_whin all examples.
Step 1: Create the source data¶
Insert 100,000 products into both dimension tables and 10 million orders into the fact table:
Step 2: Create two pipelines and run the initial refresh¶
Create one pipeline with the primary key dimension table and one without. Both use the same join query.
Step 3: Simulate INSERT OVERWRITE and compare¶
Rewrite both dimension tables, changing the price for 10% of products (Category 01) while keeping all other rows identical:
Refresh both pipelines and compare their performance:
Check the refresh history for each:
The pipeline with the primary key processed only the 1 million fact rows that reference the changed 10% of dimension rows. The pipeline without a primary key refreshed all 10 million rows because Snowflake couldn’t distinguish changed rows from unchanged rows after the INSERT OVERWRITE.
Tip
The performance gap widens as fact tables grow and the fraction of changed dimension rows shrinks. In production pipelines with millions of dimension rows and single-digit percentage changes per load cycle, primary key-based change tracking can reduce refresh times by an order of magnitude.
Clean up¶
What’s next¶
- Optimize queries for incremental refresh for query-level tuning of incremental refresh.
- Immutability constraints and backfill to skip unchanged historical partitions.
- Monitor dynamic tables to track refresh duration and detect regressions.
- Understanding costs for dynamic tables to understand the credit impact of refresh modes.
- Quick-start best practices for dynamic tables
