Using change data capture with interactive tables

This topic explains how to use change data capture (CDC) with interactive tables. For general information about interactive tables and interactive warehouses, see Snowflake interactive tables and interactive warehouses.

Key benefits over the standard MERGE INTO pattern

  1. Realtime updates

  2. Much less ingestion cost compared to standard MERGE INTO pattern for lower volume updates

  3. Query level option to control deduplication behavior

When to use

Scenarios where you want to capture changes from a source system (E.g. a transactional database such as MySQL and Postgres). For example, user profiles, product catalogs, latest pricing snapshots, etc.

When NOT to use

When your main workload is transactional processes, such as processing payments, inventory updates, etc. This should NOT be your primary transactional store. Specifically, we do not support multi-table transactions. Instead, you should use Hybrid Tables or Postgres tables for your transactional store. Use an interactive table with CDC as a mirror of your transactional store to accelerate your analytical queries.

Prerequisites

  1. Your account must be enabled for CDC, otherwise you’ll see and error that “VERSION BY” expression is not supported. Please contact Snowflake Support to enable CDC for your account.

  2. You must use Snowpipe High-Performance Architecture for ingestion

  3. You can only stream data into CDC-enabled interactive tables, batch loading and DMLs are not supported.

How CDC works for interactive tables

The interactive table with CDC support works as follows:

  • Ingestion appends change events as new rows. The new row needs to contain all the values for all the columns, including the columns that are not changing.

  • A deferred primary key defines the logical identity of each record.

  • The VERSION BY expression determines which version wins when multiple rows share the same key. When multiple rows have the same VERSION BY value, Snowflake picks one at random because they represent the same data state.

  • At query time, you can use the DEDUP_OPTION parameter to control whether Snowflake returns the latest deduplicated state, all versions, or the last merged state.

  • Background deferred merge process merges multiple versions of the same logical record into a single row, where the highest VERSION BY valued record is kept. When there are conflicting rows, the system will pick one row at random (since they are equivelent).

Creating a CDC-enabled interactive table

Every CDC-enabled interactive table requires these elements:

  1. A deferred PRIMARY KEY constraint that identifies the logical record. (e.g., order_id)

  2. A VERSION BY expression that establishes which change is the latest version. (e.g. updated_at timestamp)

  3. A CLUSTER BY clause that aligns with your query patterns. (e.g., customer_id)

  4. A VOLATILE clause for columns that change across versions. (e.g., status, amount)

The general syntax is:

CREATE [ OR REPLACE ] INTERACTIVE TABLE <table_name> (
  <col_name> <col_type> [ , <col_name> <col_type> ... ]
  CONSTRAINT <constraint_name>
    PRIMARY KEY ( <col_name> [ , <col_name> ... ] ) INITIALLY DEFERRED
)
VERSION BY <version_expr>
CLUSTER BY ( <expr> [ , <expr> ... ] )
[ VOLATILE ( <col_name> [ , <col_name> ... ] ) ]
;

The following example creates a CDC-enabled interactive table named orders:

CREATE OR REPLACE INTERACTIVE TABLE orders (
  order_id INT,
  customer_id INT,
  status VARCHAR,
  amount NUMBER(10,2),
  updated_at TIMESTAMP_NTZ,
  CONSTRAINT pk_orders PRIMARY KEY (order_id) INITIALLY DEFERRED
)
VERSION BY updated_at
CLUSTER BY (order_id)
VOLATILE (status, amount, updated_at);

The preceding example declares order_id as the logical key, uses updated_at to decide which row is current, and marks status and amount and updated_at as changing columns.

Choosing CDC metadata

PRIMARY KEY … INITIALLY DEFERRED

Identifies the columns that define a logical record for deduplication. This is not a traditional uniqueness constraint. Snowflake does not reject duplicate keys at write time. INITIALLY DEFERRED is to indicate that the constraint is only enforced during the deferred merge process.

VERSION BY

Determines which row wins when multiple rows share the same primary key. Snowflake keeps the row with the highest VERSION BY value. Good choices include a source commit timestamp, monotonically increasing sequence number, or log sequence number.

VOLATILE

Mark columns whose values change across versions. You MUST include column(s) in the VERSION BY clause. Use this for columns such as status or account_balance when queries often filter on those columns. For more information, see When to use VOLATILE.

For guidance on choosing clustering expressions, see Clustering Keys & Clustered Tables.

Choosing a scan option at query time

When you query a CDC-enabled interactive table, use DEDUP_OPTION to control whether Snowflake returns the latest deduplicated state, all versions, or the last merged state.

SELECT * FROM orders DEDUP_OPTION = LATEST_DEDUP;
SELECT * FROM orders DEDUP_OPTION = LATEST_ALL;
SELECT * FROM orders DEDUP_OPTION = LAST_DEDUP;

Use these options as follows:

  • LATEST_DEDUP returns the newest row per key by merging the latest ingested data at query time.

  • LATEST_ALL returns all versions without deduplication/merging.

  • LAST_DEDUP returns the latest deduplicated state as of the last merge point. This is typically the fastest option, but it can be slightly stale.

Diagram comparing LAST_DEDUP, LATEST_DEDUP, and LATEST_ALL for interactive table change data capture queries.

Deduplication Option

Duplicates allowed

Query performance

Data freshness

LAST_DEDUPED (Default)

No

Fastest

A few minutes

LATEST_DEDUP

No

Slowest

Upon ingest

LATEST_ALL

Yes

Slightly slower than LAST_DEDUPED

Upon ingest

Example: basic deduplication

The following example shows a simple pattern where three change events arrive for the same logical row:

CREATE OR REPLACE INTERACTIVE TABLE orders (
  order_id INT,
  customer_id INT,
  status VARCHAR,
  amount NUMBER(10,2),
  updated_at TIMESTAMP_NTZ,
  CONSTRAINT pk_orders PRIMARY KEY (order_id) INITIALLY DEFERRED
)
VERSION BY updated_at
VOLATILE (status, amount, updated_at)
CLUSTER BY (order_id);

Assume the following three versions were ingested for order_id = 9001:

  • PENDING at 2026-02-27 07:00:00

  • SHIPPED at 2026-02-27 07:00:20

  • DELIVERED at 2026-02-27 07:00:30

Querying the table without specifying DEDUP_OPTION:

SELECT order_id, customer_id, status, amount, updated_at
  FROM orders
  WHERE order_id = 9001;

Would return the latest record that’s been merged into the table (this is same as querying with DEDUP_OPTION = LAST_DEDUP).

To return the latest record, including those that have not been merged into the table yet, query the table with DEDUP_OPTION = LATEST_DEDUP:

SELECT order_id, customer_id, status, amount, updated_at
  FROM orders DEDUP_OPTION = LATEST_DEDUP
  WHERE order_id = 9001;

Note LATEST_DEDUP is relatively more expensive as it needs to merge the latest ingested data at query time.

We also offer a mode to return all versions, including those that have not been merged into the table yet, without deduplication/merging.

SELECT order_id, customer_id, status, amount, updated_at
  FROM orders DEDUP_OPTION = LATEST_ALL
  WHERE order_id = 9001;

This might be useful for debugging and for cases where you are ok with seeing some duplicate records but want to optimize query performance and reduce ingestion latency.

Setting a default scan option

You can set the default scan behavior at the table, schema, database, or account level. In practice, you should set the default at the table level or at the query level.

ALTER { TABLE | SCHEMA | DATABASE | ACCOUNT } <name>
  SET DEFAULT_INTERACTIVE_TABLE_SCAN_VERSION =
    { LATEST_DEDUP | LATEST_ALL | LAST_DEDUP };

For example, to set the default for one table:

ALTER TABLE orders
  SET DEFAULT_INTERACTIVE_TABLE_SCAN_VERSION = LATEST_DEDUP;

When Snowflake decides which scan behavior to use, it applies this priority order:

  1. A query-level DEDUP_OPTION value.

  2. A default value set through DEFAULT_INTERACTIVE_TABLE_SCAN_VERSION.

  3. The fallback default of LAST_DEDUP.

Example: using a composite primary key

Sometimes you might need more than one column to uniquely identify a logical row. For example, an order line might require both order_id and line_id:

CREATE OR REPLACE INTERACTIVE TABLE order_lines (
  order_id INT,
  line_id INT,
  status VARCHAR,
  updated_at TIMESTAMP_NTZ,
  CONSTRAINT pk_order_lines
    PRIMARY KEY (order_id, line_id) INITIALLY DEFERRED
)
VERSION BY updated_at
CLUSTER BY (order_id, line_id)
VOLATILE (status);

If two events arrive for (101, 1) and one event arrives for (101, 2), the following query returns one latest row for each composite key:

SELECT order_id, line_id, status, updated_at
  FROM order_lines DEDUP_OPTION = LATEST_DEDUP
  ORDER BY order_id, line_id;

In this case, (101, 1) resolves to the latest status for line 1, while (101, 2) remains a separate row. If you defined the primary key only as (order_id), Snowflake would incorrectly collapse both lines into one logical record.

When to use VOLATILE

This applies ONLY when using the LATEST_DEDUP deduplication option.

A column should be VOLATILE if all are true:

  1. It is not the PK

  2. It can change across versions for the same PK

Note: WHERE filtering on VOLATILE columns is not as efficient as filtering on non-VOLATILE columns. This is why you should only use VOLATILE if you meet the above criteria.

For example, if the table is created WITHOUT the VOLATILE clause on status, assume the following two versions were ingested for order_id = 9001:

  • SHIPPED at 2026-02-27 07:00:20

  • PENDING at 2026-02-27 07:00:00

(Note: the second version is a late arrival / correction)

The following query would return results (Incorrectly):

SELECT order_id, customer_id, status, amount, updated_at
  FROM orders DEDUP_OPTION = LATEST_DEDUP
  WHERE order_id = 9001
  AND status = 'PENDING';

If the table is created WITH the VOLATILE clause on status, the query would return no results. This is happening because the query is effectively filtering on status=’PENDING’ before dedup, which kept the older version. The VOLATILE hint helps to avoid this issue.

Design guidelines

Use the following guidelines when you design CDC-enabled interactive tables:

  • Choose a primary key that matches the logical identity of the source record.

  • Choose a VERSION BY expression that increases reliably as each record changes.

  • Avoid VERSION BY expressions that can produce ambiguous ordering for the same key.

  • Mark columns that can be updated across versions as VOLATILE.

  • Use LATEST_ALL when you need to validate event delivery, ordering, or duplicate ingestion.

  • Use LATEST_DEDUP when the query must reflect the newest state.

  • Use LAST_DEDUP for broad reporting queries that can tolerate slight staleness.

Performance tip

Regular clustering key design principles also apply to CDC-enabled interactive tables. In particular, you should cluster on columns that are frequently used in WHERE clauses. For more information, see Clustering Keys & Clustered Tables. Your clustering key DOES NOT have to match your primary key and it can work on VOLATILE columns as well.

Troubleshooting

If queries filtered on a updated column return unexpected results, consider whether that column should be listed in VOLATILE.

If the latest record is not selected as expected, verify that the VERSION BY expression correctly orders all changes for each key.

Setting up streaming ingestion for a CDC-enabled interactive table

CDC-enabled interactive tables ingest data exclusively through Snowpipe Streaming high-performance architecture . Batch loads and DMLs aren’t supported. The procedure below shows an end-to-end setup that can be run in an account that has the Interactive CDC Preview enabled.

For general streaming patterns (the Kafka connector, Openflow, and default-pipe behavior), see Creating an interactive table using streaming ingestion.

Install prerequisites

Create and activate a Python virtual environment, then install the Snowpipe Streaming SDK and the Snowflake connector. Do not install packages system-wide with --break-system-packages.

python3 -m venv .venv
source .venv/bin/activate
pip install snowflake-ingest snowflake-connector-python pytest

Create the table

Create the CDC-enabled interactive table. No pipe is needed up front — when the first open_channel call targets this table, Snowflake automatically creates and manages a default pipe named ORDERS-STREAMING:

CREATE OR REPLACE INTERACTIVE TABLE orders (
  order_id    INT,
  customer_id INT,
  status      VARCHAR,
  amount      DECIMAL(10,2),
  updated_at  TIMESTAMP_NTZ,
  CONSTRAINT pk_orders PRIMARY KEY (order_id) INITIALLY DEFERRED
)
VERSION BY updated_at
CLUSTER BY (order_id)
VOLATILE (status, amount);

Note

Optional: create a named pipe instead. If you need column-mapping transformations, or want to grant, monitor, or replicate the pipe independently of the table, you can create an explicit pipe and pass its name to StreamingIngestClient as pipe_name:

CREATE OR REPLACE PIPE orders_cdc_pipe
  AS COPY INTO orders
  FROM TABLE(DATA_SOURCE(TYPE => 'STREAMING'))
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

For most CDC setups the default pipe is sufficient.

Produce CDC events

Use the Snowpipe Streaming SDK to open a channel against the orders table and append change events. Pass the table name as ORDERS-STREAMING (without a schema prefix) to use the default pipe. A single Python script can exercise the common CDC patterns — an insert, an update, and a soft-delete — for one logical row:

import uuid
from datetime import datetime, timedelta, timezone

from snowflake.ingest.streaming import StreamingIngestClient

client = StreamingIngestClient(
    client_name=f"cdc_demo_{uuid.uuid4().hex[:8]}",
    db_name="MY_DB",
    schema_name="MY_SCHEMA",
    pipe_name="ORDERS-STREAMING",  # default pipe; auto-created on first open_channel
    profile_json="profile.json",
)
channel, _ = client.open_channel("cdc_demo")

now = datetime.now(timezone.utc)
events = [
    (9001, 1, "PENDING",  10.00, now - timedelta(seconds=30)),  # insert
    (9001, 1, "SHIPPED",  10.00, now - timedelta(seconds=10)),  # update
    (9001, 1, "DELETED",  10.00, now),                          # soft-delete
]

for offset, (oid, cid, status, amount, updated_at) in enumerate(events, start=1):
    channel.append_row(
        {
            "ORDER_ID":    oid,
            "CUSTOMER_ID": cid,
            "STATUS":      status,
            "AMOUNT":      amount,
            "UPDATED_AT":  updated_at.strftime("%Y-%m-%d %H:%M:%S.%f"),
        },
        str(offset),
    )

channel.close()
client.close()

Each event includes a full row (all columns), because ingestion appends new versions rather than mutating existing rows. Snowflake uses the VERSION BY expression (updated_at) to pick the winning version per primary key.

Verify deduplication

After the producer runs, allow a few seconds for the events to land, then query the table with DEDUP_OPTION = LATEST_DEDUP and confirm that a single row remains for order_id = 9001 with STATUS = 'DELETED':

import time

import snowflake.connector

time.sleep(15)  # Wait for the stream flush.

conn = snowflake.connector.connect(
    account="<account>",
    user="<user>",
    role="STREAMING_ROLE",
    warehouse="<warehouse>",
    authenticator="PROGRAMMATIC_ACCESS_TOKEN",
    token="<pat_token>",
    database="MY_DB",
    schema="MY_SCHEMA",
)
cur = conn.cursor()
cur.execute(
    "SELECT order_id, status, updated_at "
    "FROM orders DEDUP_OPTION = LATEST_DEDUP "
    "WHERE order_id = 9001"
)
rows = cur.fetchall()
assert len(rows) == 1 and rows[0][1] == "DELETED"

The snowflake.connector.connect() call shown here uses programmatic access token (PAT) authentication. For key-pair and other auth options, see Key-pair authentication and key-pair rotation.

See also