How refresh mode affects dynamic table performance¶
A dynamic table’s actual refresh mode is determined at
creation time and is immutable afterward. If not specified explicitly, the refresh mode defaults to AUTO
,
which selects a refresh mode based on various factors such as query complexity, or unsupported constructs,
operators, or functions.
To verify the refresh mode for your dynamic tables, see View dynamic table refresh mode.
Tip
To determine the best mode for your use case, experiment with automatic recommendations and the concrete refresh modes (full and incremental). The best mode for your dynamic tables’ performance depends on data change volume and query complexity. Additionally, testing different refresh modes with a dedicated warehouse helps isolate costs and improve performance tuning based on actual workloads.
For consistent behavior across Snowflake releases, explicitly set the refresh mode on all production
tables. The behavior of AUTO
might change between Snowflake releases, which can cause unexpected
changes in performance if used in production pipelines.
For more information, see Best practices for optimizing performance.
Full refresh mode performance¶
A full refresh executes the query and overwrites the dynamic tables with the results. The content of a dynamic table is the same regardless of whether full or incremental refresh mode is chosen.
Full refresh is typically used for complex queries or workloads where incremental refresh is less efficient.
Although full refresh is resource-intensive, it is useful when reprocessing the entire dataset is necessary, such as for unsupported incremental operations or significant data changes.
To optimize full refresh performance, treat it like any other Snowflake query, but remember the cost includes both executing the query and inserting the results, not just the query execution.
Incremental refresh mode performance¶
An incremental refresh focuses on applying changes since the last refresh, making it more efficient for large datasets with small updates. The content of a dynamic table is the same regardless of the chosen refresh mode.
However, incremental refresh can be more resource-efficient because it skips reprocessing unchanged data. The decision to use incremental refresh depends on the characteristics of your workload, such as the volume and complexity of changes, and the potential performance gains in terms of speed and resource savings.
The following sections explain what makes a workload suitable for incremental refresh. If your workload doesn’t fit the conditions described in these sections, try using the full refresh mode for potentially better efficiency.
For information about optimizing incremental refresh performance, see Incremental refresh performance.
Note
The recommendations in this documentation might change with improved support for incremental refresh queries and performance over time.
Understanding incremental refresh performance¶
In an incremental refresh, most of the effort usually goes into computing changes in the dynamic table. This depends on the query and can be quite complex. A common misunderstanding is that an incremental refresh only scans changes in the source tables, not the source tables themselves. This can lead to the misconception that an incremental refresh should only do work proportional to the amount of source data that changed, which isn’t true. In reality, incremental refreshes often need to scan the source tables directly.
For example, imagine a query that does an inner join between tables A and B. If a row is inserted into table A, it must be joined with table B to compute changes in the query. This single row in A can join with many rows in B, which can mean there’s a lot of work even if there are only a few changes in the sources.
This extra work can happen with various operators. Incremental refresh processes new data and skips work that’s already done. Deciding what to skip can take additional work, especially for complex queries, and different operators can skip work in different ways.
Typically, the size of changes and locality impact how much work can be skipped.
How size affects incremental refresh performance¶
The most significant factor that affects incremental refresh performance is the size of the changes in the source data. When determining the size of changes for your dynamic table refreshes, ensure that you include the copied rows in the count. A DML that changes some rows in a micro-partition also copies the unchanged rows in that micro-partition into a new micro-partition.
To analyze the number of changed rows, in a base table, create a stream on the base table at the time of the last refresh and use SYSTEM$STREAM_BACKLOG. For example:
CREATE STREAM mystream ON TABLE mybasetable BEFORE(STATEMENT => 'last refresh UUID');
SELECT * FROM SYSTEM$STREAM_BACKLOG('mystream');
As an extreme example, consider the effect of deleting all of the data from a source: a full refresh just sees an empty table, which can be processed very quickly. In contrast, an incremental refresh has to process every deleted row, making it much slower.
Similar slowdowns can occur in less extreme cases too. A good guideline for a favorable workload is to keep changes in the source or target to less than 5% of the rows.
How locality affects incremental refresh performance¶
The second most significant factor that affects incremental refresh is locality, which refers to how closely related data or actions are along different dimensions.
For instance, if you have a table with a timestamp column, and you always insert rows with the current time in that column, then your workload has a strong locality between insertion order and the timestamp column.
Locality can show up in various forms, but a few are especially important for incremental refresh. Improving locality in any of the following areas boosts the performance of incremental refreshes, although it’s not always possible to have strong locality in all three categories.
Locality area |
Description |
---|---|
Locality between clustering and partitioning keys. |
When doing a partitioning operation in the dynamic table definition, it’s beneficial if the underlying sources are clustered based on these partitioning keys. For example, if you’re joining two tables using an ID, it’s better for incremental refresh performance if the tables are clustered by their respective ID columns. |
Locality between partitioning or grouping keys and source changes. |
Ideally, changes to sources should only have a partitioning key in common with a small fraction of rows in the source table(s). For example, if you’re inserting rows with the current timestamp, grouping by hour would work well due to strong locality between the keys and source changes. However, if you’re inserting rows with a column value that appears in many other rows across the table, grouping by that column results in poor incremental refresh performance. |
Locality between target table changes and clustering. |
When an incremental refresh applies changes to a dynamic table, updates and deletions are joined against the current state of the dynamic table. This join performs better if changes align with the clustering on the dynamic table. For example, if refreshes only update recently inserted rows, they align well with the table’s clustering. For information about how Snowflake tables are stored, see Understanding Snowflake Table Structures. To manage clustering on a table, use Automatic Clustering. |
Performance expectations for incremental refreshes of individual operators¶
The following table shows the approximate performance expectations for incremental refreshes of individual operators. Performance is measured relative to a full refresh with the assumption that only 5% of rows have changed and the refresh jobs take at least 1 minute.
Note
Due to fixed overheads (for example, query optimzation, warehouse scheduling, and job cleanup) that don’t speed up with query optimization, short queries (less than 10 seconds) might see smaller performance gains.
Operator |
Performance increase |
---|---|
SELECT |
10x |
WHERE |
10x |
FROM |
10x |
UNION ALL |
10x |
Scalar Aggregates |
10x |
For operators that are affected by locality, the table shows performance expectations with good and poor locality. Note that for some operators, poor locality can lead to worse performance than full refreshes.
Operator |
Locality |
Performance increase |
---|---|---|
GROUP BY |
Good |
5x |
GROUP BY |
Poor |
1/3x |
DISTINCT |
Good |
5x |
DISTINCT |
Poor |
1/4x |
OVER |
Good |
2-5x |
OVER |
Poor |
1/5x |
INNER JOIN |
Good |
10x |
INNER JOIN |
Poor |
2x |
OUTER JOIN |
Good |
3x |
OUTER JOIN |
Poor |
0.1x |
For more information, see How operators incrementally refresh.
Optimizing incremental refresh mode performance for complex dynamic tables¶
Dynamic tables typically contain several operators, making it harder to predict their performance with incremental refreshes. This section explores how to handle this challenge and gives some tips for enhancing the performance of complex dynamic tables.
When there are multiple operators involved, incremental refreshes compute changes by working on each operator separately, turning it into a fragment of a query plan that can compute changes based on its inputs. For each input, this new fragment can request the input before changes, after changes, or just the changes themselves. By applying this process to every operator in the original query, you get a new query plan that computes changes using a mix of change-scans and full table scans. This plan is optimized by Snowflake’s query optimizer and executed like any other query.
When an incremental refresh doesn’t perform well, it’s usually because there are too many changes or the locality is poor. Complex queries make it more challenging to identify these issues. The performance for incremental operators usually depends on the amount of changes and the locality of their inputs. However, these inputs are the outputs of other operators, so the amount and locality of data can change as it moves through operators.
Therefore, understanding the performance of a complex incremental refresh requires considering each operator’s input separately. Here are some common scenarios and suggestions to address them:
Scenario |
Recommendation |
---|---|
You’re joining tables across multiple columns, so you can’t use CLUSTER BY on all of them simultaneously. |
Prioritize clustering larger tables by keys that change frequently. For instance, in a star schema with a big dimension table, focus on clustering the dimension table. Consider creating multiple copies of the same dataset, each clustered by different keys, and using them in relevant contexts. |
You have a GROUP BY or OVER on many joins. |
Ensure source tables are clustered by grouping/partitioning keys, and consider splitting joins and aggregation into two separate dynamic tables. Note that outer joins interact poorly with aggregations. |