Introduction to Data Pipelines¶
Data pipelines automate many of the manual steps involved in transforming and optimizing continuous data loads. Frequently, the “raw” data is first loaded temporarily into a staging table used for interim storage and then transformed using a series of SQL statements before it is inserted into the destination reporting tables. The most efficient workflow for this process involves transforming only data that is new or modified.
Features Included in Continuous Data Pipelines¶
Snowflake provides the following features to enable continuous data pipelines:
- Continuous data loading:
Options for continuous data loading include the following:
- Continuous Data Transformation:
Dynamic tables are declarative automated data pipelines that simplify data engineering and provide a simple way to transform data. Rather than defining data transformation steps as a series of tasks, you can simply define the end state of the transformation.
Refer to Dynamic tables for more information and details.
- Change data tracking:
A stream object records the delta of change data capture (CDC) information for a table (such as a staging table), including inserts and other data manipulation language (DML) changes. A stream allows querying and consuming a set of changes to a table, at the row level, between two transactional points of time.
In a continuous data pipeline, table streams record when staging tables and any downstream tables are populated with data from business applications using continuous data loading and are ready for further processing using SQL statements.
For more information, see Change Tracking Using Table Streams.
- Recurring tasks:
A task object defines a recurring schedule for executing a SQL statement, including statements that call stored procedures. Tasks can be chained together for successive execution to support more complex periodic processing.
Tasks may optionally use table streams to provide a convenient way to continuously process new or changed data. A task can transform new or changed rows that a stream surfaces. Each time a task is scheduled to run, it can verify whether a stream contains change data for a table (using SYSTEM$STREAM_HAS_DATA) and either consume the change data or skip the current run if no change data exists.
Users can define a simple tree-like structure of tasks that executes consecutive SQL statements to process data and move it to various destination tables.
For more information, see Executing SQL statements on a schedule using tasks.