Data validation¶

The Data Validation feature provides a fault-tolerant, scalable way to verify that data migrated into Snowflake matches the data in the original source system. It runs as a Cloud Data Validation workflow on the same infrastructure used by Data Migration, so you can migrate and validate with the same Orchestrator and Workers.

Cloud Data Validation is designed for migration scenarios where you are moving data from a system that you plan to decommission and need confidence that the migrated data is correct before cutting over. Supported source platforms for cloud validation workflows are SQL Server, Amazon Redshift, Teradata, Oracle, and PostgreSQL.

Architecture overview¶

Cloud Data Validation uses the same two components as Data Migration: an Orchestrator and one or more Workers.

The Orchestrator connects to the Snowflake account. It requires privileges to create and operate the SNOWCONVERT_AI database, where workflow, task, and validation metadata is stored.
One or more Workers connect to both the source system and the Snowflake account. Workers run validation queries on both sides, export intermediate results, upload them to Snowflake, and write their outcomes to shared results tables. Workers pick up tasks created by the Orchestrator and process them in parallel.
Validation results (schema, metrics, and row-level outcomes) are ingested into the shared results tables via Snowpipe by default and then evaluated by the Orchestrator.

Each Worker that executes validation tasks must have the validation runtime available in the same environment the Snowflake AIM Agent for Data Warehouses starts for that process. Workers that only run data-migration tasks can skip that runtime; they will not pick up data_validation tasks from the queue.

Deployment options¶

The Orchestrator and Workers can be deployed in multiple ways:

Both on Snowpark Container Services (in the Snowflake account).
Both in the customer’s environment, including custom hardware, virtual machines, or containers.
Orchestrator on Snowpark Container Services and Workers in the customer’s environment, or the other way around.

The following requirements apply to the environment where the Orchestrator and Workers run:

Use the Snowflake AIM Agent for Data Warehouses. The agent will and manage the underlying runtime when needed.
Workers typically require an ODBC driver to connect to the source system. For Teradata, the pure-Python teradatasql driver is supported and preferred when the skill or environment provides it.
The Orchestrator must be able to connect to the Snowflake account using a role that has privileges to create the SNOWCONVERT_AI database and create schemas and objects within it.

Validation levels¶

Cloud Data Validation performs comparisons between each source table (or view) and its corresponding target in Snowflake at three increasingly detailed levels. Each level can be enabled or disabled independently in the workflow configuration.

Schema validation (L1)¶

Schema validation confirms that the structure of each migrated table is preserved in Snowflake. It compares the following attributes between source and target:

Table name
Column names
Ordinal position of each column
Data types
Character maximum length for text columns
Numeric precision and scale for numeric columns
Row count

Metrics validation (L2)¶

Metrics validation confirms that aggregate statistics of the migrated data match the original source. Specific metrics vary by column data type, but metrics validation typically compares:

Minimum value
Maximum value
Average
Null count
Distinct count
Standard deviation
Variance

Row validation (L3)¶

Row validation performs row-level or cell-level comparison between source and target. Configure the mode with validation_configuration.row_validation_mode (see Validation configuration):

row (default): MD5-chunked whole-row comparison using index_column_list alignment.
cell: cell-level comparison with per-column mismatch reporting.
hybrid: two-phase flow—row-level fingerprint first, then cell drilldown on partitions that fail the first phase (reduces cost versus pure cell on large tables when supported).

Row validation is disabled by default and is typically applied only to the tables where it is needed because it is the most resource-intensive level.

Prerequisites¶

Before you use Cloud Data Validation, make sure the following are in place:

Snowflake AIM Agent for Data Warehouses: Use the agent’s data-validation skill for guided validation. Workers that run validation tasks receive the validation runtime through the same install path the agent uses for that worker process.
Snowflake access: Connections for the Orchestrator and Workers in your Snowflake config.toml or connections.toml, using a role that can create the SNOWCONVERT_AI database and its objects. The first time the Orchestrator starts, it creates that database and related resources if they don’t exist yet. On later runs, use a role that can administer SNOWCONVERT_AI and its objects; sticking with the same role you used for the initial creation is the simplest way to avoid permission issues.
Source connectivity: For SQL Server and Redshift sources, an ODBC driver on the machine where Workers run. For Oracle, install Oracle Instant Client and a matching ODBC driver on Workers (same [connections.source.oracle] patterns as data migration). For PostgreSQL, Workers use Npgsql (no ODBC install; same [connections.source.postgresql] and ssl_mode patterns as data migration). For Teradata, prefer the teradatasql driver when provided by your environment, or configure pyodbc with the Teradata ODBC driver. Programmatic Access Tokens (PATs) are recommended for Snowflake connections; see Connecting to Snowflake with a PAT.
Target data available: The tables you want to validate must already exist in Snowflake, either because you migrated them using Data Migration or by another means. For accurate validation, don’t alter the migrated data between the migration and the validation run.
Snowpark Container Services (optional): If you deploy the Orchestrator or Workers on Snowflake compute, your account needs SPCS. See the Snowpark Container Services overview. Running both components outside Snowflake doesn’t require SPCS.

Source-platform specifics¶

Workers reuse the same [connections.source.*] TOML as data migration. The tabs below call out validation-specific behavior and common source-side deployment patterns. For full connectivity examples (including IAM Redshift, Oracle, PostgreSQL, and Teradata regular / write_nos / tpt migration settings), see Data migration.

Connectivity: ODBC to SQL Server from the Worker host; follow the same driver and encryption guidance as in Data migration (SQL Server tab).
Partitioning: Use column_names_to_partition_by and target_partition_size_mb / target_partition_size_rows in the validation JSON to keep L2/L3 work within reasonable bounds.
L3 alignment: Set index_column_list (and target_index_column_list when names differ) so row and hybrid modes can align rows deterministically.

Topic	Data migration (load)	Cloud data validation
Purpose	Move data with `extraction.strategy` `regular`, `write_nos`, or `tpt`	Compare live Teradata tables/views to Snowflake with L1/L2/L3
Teradata connection	Required on Workers that run migration tasks	Required on Workers that run validation tasks (same `teradatasql` / ODBC patterns)
*`write_nos_` TOML**	Required when strategy is `write_nos`	Not required for validation-only workloads (validation does not execute WRITE_NOS)
`tbuild` / TTU	Required on Workers that can receive `tpt` migration tasks	Not required for validation-only Workers (L2/L3 use SQL, not TPT)

Connectivity

Match Data migration — Teradata: prefer teradatasql when the environment provides it; otherwise set odbc_driver to the exact registered driver name, optional dbc_name, port 1025 by default, optional authentication (TD2, LDAP, KRB5).

After a migration that used tpt or write_nos

The migrated data in Snowflake was produced by those extraction paths, but validation still reads the source over SQL (metrics and row/cell plans). You do not re-enable write_nos_* or install tbuild on a host solely to run validation—unless the same Worker process also executes migration tpt or write_nos tasks. Ensure the Teradata objects you validate are still reachable and representative (same database, schema, and table or view names as in the validation workflow).

Schema validation on views (L1)

For Teradata views, L1 is a reduced comparison (column existence and datatype via HELP COLUMN metadata). Precision, scale, length, nullability, and ordinal checks are not available for Teradata views; use L2/L3 for deeper assurance.

Partitioning (L2/L3)

Use column_names_to_partition_by and target_partition_size_mb / target_partition_size_rows so wide tables do not time out; large tpt/write_nos migrations do not change how validation issues source SQL.

Oracle is supported for cloud data validation through scai data validate start, scai data validate generate-config, and scai data validate create-workflow (shared scai data worker and scai data orchestrator commands are the same as in Data migration).

Project and connection: Use scai init … -l Oracle and scai connection add-oracle / scai connection set-default -l oracle -s <profile-name>. Do not use the deprecated scai data validate-legacy command on Oracle projects (legacy validation supports SQL Server, Redshift, and Teradata only).
Connectivity: Match Data migration — Oracle: Oracle Instant Client + ODBC on Workers and [connections.source.oracle] in Worker TOML (oracle_connection_mode, service name, optional wallet settings).
Workflow file: YAML (default .scai/config/data-validation-config.yaml). source_platform must be oracle (scai data validate generate-config sets this automatically on Oracle projects).
Source reads: Validation runs SQL against live Oracle for L2/L3, including after you migrated with scai data migrate …. Do not change migrated Snowflake data between migration and validation.
Partitioning (L2/L3): Use column_names_to_partition_by and target_partition_size_mb / target_partition_size_rows on wide tables.
L3 alignment: Set index_column_list (and target_index_column_list when names differ) for row or hybrid modes.

Example: validate after Oracle migration

scai data validate generate-config -o .scai/config/data-validation-config.yaml
scai data validate start -c my-snowflake

# Or submit a cloud workflow explicitly:
scai data validate create-workflow \
  --config .scai/config/data-validation-config.yaml \
  -c my-snowflake --watch

PostgreSQL is supported for cloud data validation through scai data validate start, scai data validate generate-config, and scai data validate create-workflow (shared scai data worker and scai data orchestrator commands are the same as in Data migration).

Project and connection: Use scai init … -l Postgresql and scai connection add-postgresql / scai connection set-default -l postgresql -s <profile-name>. Do not use the deprecated scai data validate-legacy command on PostgreSQL projects (legacy validation supports SQL Server, Redshift, and Teradata only).
Connectivity: Match Data migration — PostgreSQL: Npgsql on Workers and [connections.source.postgresql] in Worker TOML (ssl_mode, standard auth).
Workflow file: YAML (default .scai/config/data-validation-config.yaml). source_platform must be postgresql (scai data validate generate-config sets this automatically on PostgreSQL projects).
Source reads: Validation runs SQL against live PostgreSQL for L2/L3, including after you migrated with scai data migrate …. Do not change migrated Snowflake data between migration and validation.
Partitioning (L2/L3): Use column_names_to_partition_by and target_partition_size_mb / target_partition_size_rows on wide tables.
L3 alignment: Set index_column_list (and target_index_column_list when names differ) for row or hybrid modes.

Example: validate after PostgreSQL migration

scai data validate generate-config -o .scai/config/data-validation-config.yaml
scai data validate start -c my-snowflake

# Or submit a cloud workflow explicitly:
scai data validate create-workflow \
  --config .scai/config/data-validation-config.yaml \
  -c my-snowflake --watch

Warning

For accurate validation and to avoid false negatives, don’t alter the migrated data during the validation process.

Setup¶

Installation¶

Follow the Snowflake AIM Agent for Data Warehouses setup in Snowflake CoCo so the Snowflake AIM Agent for Data Warehouses, scai, and worker dependencies (including validation components when you run validation workflows) are installed for you. For manual CLI-only use, install the SnowConvert AI CLI and work from a project directory; scai data validate start and related commands resolve missing worker-side components when orchestrating local workers.

Workers that pick up only data-migration tasks do not require the validation runtime; workers that execute data-validation tasks must have validation support available in the same environment the skill or scai prepares for that worker process.

Usage¶

To validate migrated data using this solution, complete the following high-level steps:

Start the Orchestrator.
Start the Workers.
Create a Cloud Data Validation Workflow.
Monitor the validation workflow until completion.

A Cloud Data Validation Workflow is a job submitted to the system that describes which tables to validate, at which levels, and with which comparison rules. You can submit multiple workflows simultaneously and monitor them. The Orchestrator breaks each workflow into smaller tasks, one or more per table, and dispatches them to available Workers.

The Orchestrator that runs validation is the same process that runs Data Migration. If you have already set up and started an Orchestrator and Workers for migration, you can reuse them for validation: ensure workers are started with a configuration and environment that includes validation support (the Snowflake AIM Agent for Data Warehouses handles this when running locally).

Ask the Snowflake AIM Agent for Data Warehouses for validation

Run cloud data validation for my project

Monitoring a Cloud Data Validation Workflow¶

Each workflow goes through different stages throughout its lifecycle:

Pending: No tasks have been created for this workflow yet.
Executing: Tasks have been created for this workflow and there are still tasks that haven’t reached a terminal state (COMPLETED or FAILED).
Completed: All tasks have reached a terminal state (COMPLETED or FAILED).

In the data validation metadata schema (default SNOWCONVERT_AI.DATA_VALIDATION), the following views can be queried to understand the status of validation workflows:

View	Description
`TABLE_PROGRESS`	One row per validated table. Summarizes overall validation status. Can be filtered by `WORKFLOW_ID`.
`TABLE_PROGRESS_DETAIL`	Per-table breakdown with partition-level L2/L3 status (`VALID`, `INVALID`, `EXECUTION_ERROR`). Can be filtered by `WORKFLOW_ID`.
`DATA_VALIDATION_ERROR`	Errors encountered during validation. Can be filtered by `WORKFLOW_ID`.
`DATA_VALIDATION_WARNING`	Non-fatal warnings, for example, unsupported column types or metric exclusions. Can be filtered by `WORKFLOW_ID`.

In the same schema, the DATA_VALIDATION_DASHBOARD Streamlit dashboard provides a visual overview of validation progress and results, including a Table Progress tab that aggregates the views above.

You can also inspect validation queries in QUERY_HISTORY using the QUERY_TAG values set by the Orchestrator and Workers. See Query tagging.

Validation outcomes are classified into three categories:

Category	Description
OK	Values match exactly between the source database and Snowflake.
Warning	The Snowflake table has minor differences that don’t affect the data (for example, higher numeric precision).
Error	Values don’t match between the original database and the Snowflake database.

Considerations and recommendations¶

Connecting to Snowflake with a PAT¶

Use Programmatic Access Tokens (PATs) for connections used by the Orchestrator and Workers. This avoids the need to constantly authenticate through the browser or with an authenticator app. You need to establish a Network Policy or temporarily bypass the requirement for a Network Policy (which you can do from Snowsight).

Running Orchestrator and Workers on SPCS¶

To leverage Snowflake compute for these tasks:

Prepare Docker images that bundle the Orchestrator and Worker runtimes and configuration (see Snowflake guidance for SPCS service images).
Push those Docker images to an Image Repository in Snowflake.
Execute the Orchestrator and/or Worker images using Snowpark Container Services.

Keep the following in mind:

Run them as Services, not Jobs.
You can run only one component (Orchestrator or Workers) in SPCS and the other on a different platform.
Monitor the SPCS service and suspend it when it isn’t being used.
Depending on the network configuration of the source system, you might need to configure an External Access Integration so that these services can connect to your source system.

Managing Workers¶

The time it takes to complete a validation workflow depends on many variables, but the number of Workers (and threads per Worker) has the greatest impact, as it determines how many validation tasks can be executed in parallel. Consider the following:

You don’t need to run two Workers on the same machine. If you want more parallelism on a single machine, increase the thread count in the Worker TOML (max_parallel_tasks under [application]) or use the options your skill/scai run exposes for local workers.
Network bandwidth greatly affects Worker speed and is shared between threads of a Worker.
Even with many Workers and threads running in parallel, the source system might not have enough resources to handle the load.
Keep a low Worker count to avoid overloading your source system.
Consider stopping some or all Workers when the source system is already under heavy load from other operations.

Query tagging¶

Both the Orchestrator and the Worker automatically set Snowflake’s QUERY_TAG session parameter on every query they submit. Tags are compact JSON strings containing identifiers such as the workflow ID, task ID, and component version. You can use these tags to filter and attribute validation queries in QUERY_HISTORY:

SELECT query_text, query_tag, start_time
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE TRY_PARSE_JSON(query_tag):DMVF_WORKFLOW_ID IS NOT NULL
ORDER BY start_time DESC;

Tag key	Present on	Description
`DMVF_VERSION`	Infrastructure queries	Component package version.
`DMVF_WORKFLOW_ID`	Task-processing queries	Workflow that originated the task.
`DMVF_TASK_ID`	Task-processing queries	Individual task identifier.
`DMVF_ORCHESTRATOR_VERSION`	Orchestrator task-processing queries	Orchestrator package version.
`DMVF_WORKER_VERSION`	Worker task-processing queries	Worker package version.