How refresh mode affects dynamic table performance¶

A dynamic table’s actual refresh mode is determined at creation time and is immutable afterward. If not specified explicitly, the refresh mode defaults to AUTO, which selects a refresh mode based on various factors such as query complexity, or unsupported constructs, operators, or functions.

Tip

To determine the best mode for your use case, experiment with refresh modes and automatic recommendations. For consistent behavior across Snowflake releases, you should explicitly set the refresh mode on all dynamic tables.

Full refresh mode performance¶

A full refresh runs the query and overwrites the dynamic tables with the results. You can optimize full refresh performance like any other Snowflake query. Note that the cost includes both running the query and inserting the results, so don’t compare it directly with just running the query.

Incremental refresh mode performance¶

An incremental refresh calculates changes in the query’s result and merges them into the dynamic table. It’s a powerful features, but it works best under certain conditions.

The following section explains what makes a workload suitable for incremental refresh. If your workload doesn’t fit the conditions described in these sections, try full refresh mode for potentially better efficiency.

Note

The recommendations in this documentation might change with improved support for incremental refresh queries and performance over time.

Understanding incremental refresh performance¶

In an incremental refresh, most of the effort usually goes into computing changes in the dynamic table. This depends on the query and can be quite complex. A common misunderstanding is that an incremental refresh only scans changes in the source tables, not the source tables themselves. This can lead to the misconception that an incremental refresh should only do work proportional to the amount of source data that changed, which isn’t true. In reality, incremental refreshes often need to scan the source tables directly.

For example, imagine a query that does an inner join betweens tables A and B. If a row is inserted into table A, it must be joined with table B to compute changes in the query. This single row in A can join with many rows in B, causing a lot of extra work compared to the change in the sources.

This extra work can happen with various operators. So, instead of doing new work, incremental refresh skips work that’s already done. Deciding what to skip requires effort, especially for complex queries, and different operators can skip work in different ways.

Typically, the size of changes and locality impact how much work can be skipped.

How size affects incremental refresh performance¶

The most significant factor that affects incremental refresh performance is the size of the changes in the source data.

As an extreme example, think about the effect of deleting all of the data from a source - an incremental refresh has to read every deleted row, making it slower. In comparison, a full refresh just sees an empty table, so it’s faster.

Similar slowdowns can occur in less extreme cases too. A good guideline for a favorable workload is to keep changes in the source or target to less than 5% of the rows.

How locality affects incremental refresh performance¶

The second most significant factor that affects incremental refresh is locality, which refers to how closely related data or actions are along different dimensions.

For instance, if you have a table with a timestamp column, and you always insert rows with the current time in that column, then your workload has a strong locality between insertion order and the timestamp column.

Locality can show up in various forms, but a few are especially important for incremental refresh. Improving locality in any of the following areas boosts the performance of incremental refreshes, although it’s not always possible to have strong locality in all three categories.

Locality Area

Description

Locality between clustering and partitioning keys.

When doing a partitioning operation in the dynamic table definition, it’s beneficial if the underlying sources are clustered based on these partitioning keys.

For example, if you’re joining two tables using an ID, it’s better for incremental refresh performance if the tables are clustered by their ID columns.

Locality between partitioning or grouping keys and source changes.

Ideally, changes to sources should mainly involve grouping with nearby rows.

For instance, if you’re inserting rows with current timestamp, grouping by hour would work well due to strong locality between the keys and source changes. However, if you’re inserting rows with a column value that appears in many other rows across the table, grouping by that column results in poor incremental refresh performance.

Locality between target table changes and clustering.

When an incremental refresh applies changes to a dynamic table, updates and deletions are joined against the current state of the dynamic table.

This join performs better if changes align with the clustering on the dynamic table. For instance, if refreshes only update recently inserted rows, they align well with the table’s clustering.

For information on how Snowflake tables are stored, see Understanding Snowflake Table Structures. To manage clustering on a table, use Automatic Clustering.

Performance expectations for incremental refreshes of individual operators¶

The following table shows the approximate performance expectations for incremental refreshes of individual operators. Performance is measured relative to a full refresh with the assumption that only 5% of rows have changed.

Operator

Performance Increase

SELECT

10x

WHERE

10x

FROM

10x

UNION ALL

10x

Scalar Aggregates

10x

For operators that are affected by locality, the table shows performance expectations with good and poor locality. Note that, for some operators, poor locality can lead to worse performance than full refreshes.

Operator

Locality

Performance Increase

GROUP BY

Good

5x

GROUP BY

Poor

1/3x

DISTINCT

Good

5x

DISTINCT

Poor

1/4x

OVER

Good

2-5x

OVER

Poor

1/5x

INNER JOIN

Good

10x

INNER JOIN

Poor

2x

OUTER JOIN

Good

3x

OUTER JOIN

Poor

0.1x

For more information, see How operators incrementally refresh.

Optimizing incremental refresh mode performance for complex dynamic tables¶

Dynamic tables typically contain several operators, making it harder to predict their performance with incremental refreshes. This section explores how to handle this challenge and gives some tips for enhancing the performance of complex dynamic tables.

When there are multiple operators involved, incremental refreshes compute changes by working on each operator separately, turning it into a fragment of a query plan that can compute changes based on its inputs. For each input, this new fragment can request the input before changes, after changes, or just the changes themselves. By applying this process to every operator in the original query, you get a new query plan that computes changes using a mix of change-scans and full table scans. This plan is optimized by Snowflake’s query optimizer and executed like any other query.

When a complex incremental refresh doesn’t perform well, it’s usually because there are too many changes or the locality is poor.

In complex queries, each operator focuses on the amount of changes and the locality of its inputs. However, these inputs are the outputs of other operators, so the amount and locality of data can change as it moves through operators.

Here are some common scenarios and suggestions to address them:

Scenario

Recommendation

When joining tables across multiple columns, you can’t use CLUSTER BY on all of them simultaneously.

Prioritize clustering larger tables by keys that change frequently. For instance, in a star schema with a big dimension table, focus on clustering the dimension table.

Consider creating multiple copies of the same dataset, each clustered by different keys, and using them in relevant contexts.

When there’s a GROUP BY or OVER above many joins.

Ensure source tables are clustered by grouping/partitioning keys, and consider splitting joins and aggregation into two separate dynamic tables.

Note that outer joins interact poorly with aggregations.