SnowConvert AI: Data validation [Preview]

Data Validation uses the data-validation skill from the migration-plugin <https://docs.snowflake.com/en/migrations/snowconvert-docs/general/user-guide/snowconvert/migration-skill/skill>_. For more details, contact snowconvert-support@snowflake.com.

The Data Validation feature of SnowConvert AI provides a fault-tolerant, scalable way to verify that data migrated into Snowflake matches the data in the original source system. It runs as a Cloud Data Validation workflow on the same infrastructure used by Data Migration, so you can migrate and validate with the same Orchestrator and Workers.

Cloud Data Validation is designed for migration scenarios where you are moving data from a system that you plan to decommission and need confidence that the migrated data is correct before cutting over. Supported source platforms are SQL Server, Amazon Redshift, and Teradata.

Architecture overview

Cloud Data Validation uses the same two components as Data Migration: an Orchestrator and one or more Workers.

  • The Orchestrator connects to the Snowflake account. It requires privileges to create and operate the SNOWCONVERT_AI database, where workflow, task, and validation metadata is stored.

  • One or more Workers connect to both the source system and the Snowflake account. Workers run validation queries on both sides, export intermediate results, upload them to Snowflake, and write their outcomes to shared results tables. Workers pick up tasks created by the Orchestrator and process them in parallel.

  • Validation results (schema, metrics, and row-level outcomes) are ingested into the shared results tables via Snowpipe by default and then evaluated by the Orchestrator.

Each Worker needs the optional snowflake-data-validation package installed in its Python environment to execute validation tasks. Workers without it can still run Data Migration tasks but skip any data_validation tasks queued for them.

Deployment options

The Orchestrator and Workers can be deployed in multiple ways:

  • Both on Snowpark Container Services (in the Snowflake account).

  • Both in the customer’s environment, including custom hardware, virtual machines, or containers.

  • Orchestrator on Snowpark Container Services and Workers in the customer’s environment, or the other way around.

The following requirements apply to the environment where the Orchestrator and Workers run:

  • The Orchestrator and Workers are Python packages, so Python 3.11 or higher must be installed.

  • Workers typically require an ODBC driver to connect to the source system. For Teradata, the pure-Python teradatasql driver is supported and preferred when installed.

  • The Orchestrator must be able to connect to the Snowflake account using a role that has privileges to create the SNOWCONVERT_AI database and create schemas and objects within it.

Validation levels

Cloud Data Validation performs comparisons between each source table (or view) and its corresponding target in Snowflake at three increasingly detailed levels. Each level can be enabled or disabled independently in the workflow configuration.

Schema validation (L1)

Schema validation confirms that the structure of each migrated table is preserved in Snowflake. It compares the following attributes between source and target:

  • Table name

  • Column names

  • Ordinal position of each column

  • Data types

  • Character maximum length for text columns

  • Numeric precision and scale for numeric columns

  • Row count

Metrics validation (L2)

Metrics validation confirms that aggregate statistics of the migrated data match the original source. Specific metrics vary by column data type, but metrics validation typically compares:

  • Minimum value

  • Maximum value

  • Average

  • Null count

  • Distinct count

  • Standard deviation

  • Variance

Row validation (L3)

Row validation performs row-level or cell-level comparison between source and target. It supports two modes:

  • row: compares entire rows identified by the configured index_column_list.

  • cell: compares individual cell values and reports the specific columns that differ.

A hybrid mode (two-phase MD5 plus cell drilldown) is also available and is used automatically when the source and target support it, so that only rows flagged by the first phase are re-examined at the cell level. Row validation is disabled by default and is typically applied only to the tables where it is needed because it is the most resource-intensive level.

Prerequisites

Before you use Cloud Data Validation, make sure the following are in place:

  • Python environment: Python 3.11 or higher, with the snowflake-data-migration-orchestrator, snowflake-data-exchange-agent, and snowflake-data-validation packages installed (see Installation).

  • Snowflake access: Connections for the Orchestrator and Workers in your Snowflake config.toml or connections.toml, using a role that can create the SNOWCONVERT_AI database and its objects. The first time the Orchestrator starts, it creates that database and related resources if they don’t exist yet. On later runs, use a role that can administer SNOWCONVERT_AI and its objects; sticking with the same role you used for the initial creation is the simplest way to avoid permission issues.

  • Source connectivity: For SQL Server and Redshift sources, an ODBC driver on the machine where Workers run. For Teradata, install the teradatasql Python driver (preferred) or configure pyodbc with the Teradata ODBC driver. Programmatic Access Tokens (PATs) are recommended for Snowflake connections; see Connecting to Snowflake with a PAT.

  • Target data available: The tables you want to validate must already exist in Snowflake, either because you migrated them using Data Migration or by another means. For accurate validation, don’t alter the migrated data between the migration and the validation run.

  • Snowpark Container Services (optional): If you deploy the Orchestrator or Workers on Snowflake compute, your account needs SPCS. See the Snowpark Container Services overview. Running both components outside Snowflake doesn’t require SPCS.

Warning

For accurate validation and to avoid false negatives, don’t alter the migrated data during the validation process.

Setup

Installation

Install the three Python packages (Python 3.11 or higher) on the machines where the Orchestrator and Workers will run:

pip install snowflake-data-migration-orchestrator
pip install snowflake-data-exchange-agent
pip install snowflake-data-validation

Keep the following in mind:

  • The Orchestrator host only needs snowflake-data-migration-orchestrator.

  • Workers that will execute Cloud Data Validation tasks need both snowflake-data-exchange-agent and snowflake-data-validation. Workers without snowflake-data-validation can still process Data Migration tasks but skip data_validation tasks.

  • On a single host that runs both components, install all three packages in the same Python environment.

Usage

To validate migrated data using this solution, complete the following high-level steps:

  1. Start the Orchestrator.

  2. Start the Workers.

  3. Create a Cloud Data Validation Workflow.

  4. Monitor the validation workflow until completion.

A Cloud Data Validation Workflow is a job submitted to the system that describes which tables to validate, at which levels, and with which comparison rules. You can submit multiple workflows simultaneously and monitor them. The Orchestrator breaks each workflow into smaller tasks, one or more per table, and dispatches them to available Workers.

The Orchestrator that runs validation is the same process that runs Data Migration. If you have already set up and started an Orchestrator and Workers for migration, you can reuse them for validation: make sure snowflake-data-validation is installed on the Workers, then create a validation workflow as described in Creating a Cloud Data Validation Workflow.

Starting the Orchestrator

Start the Orchestrator by running:

python -m data_migration_orchestrator start

When invoked without a subcommand, start is assumed for backward compatibility:

python -m data_migration_orchestrator

The start command accepts optional logging flags:

Flag

Default

Description

--log-destination

both

Where to send logs: stdout, file, or both.

--log-file

logs/data_migration_orchestrator.log

Path to the log file (used when destination includes file).

--log-level

INFO

Logging level: DEBUG, INFO, WARNING, ERROR, or CRITICAL.

For example, to run with verbose logging to stdout only:

python -m data_migration_orchestrator start \
  --log-destination stdout \
  --log-level DEBUG

Before running, make sure the SNOWFLAKE_CONNECTION_NAME environment variable is set to a value that matches one of the connection names in your Snowflake config.toml or connections.toml. That is the name of the connection used to connect to the target Snowflake account.

By default, workflow and task metadata objects are created under SNOWCONVERT_AI.DATA_MIGRATION, and data validation metadata is created under SNOWCONVERT_AI.DATA_VALIDATION. To use a different database or schema for that metadata, set the following environment variables before starting the Orchestrator:

  • CUSTOM_SNOWFLAKE_DATABASE_FOR_METADATA (default SNOWCONVERT_AI)

  • CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_MIGRATION_METADATA (default DATA_MIGRATION)

  • CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_VALIDATION_METADATA (default DATA_VALIDATION)

If you override these values, set the same database and migration schema in each Worker’s configuration using snowflake_database_for_metadata and snowflake_schema_for_data_migration_metadata under [application] (see Worker configuration).

The Orchestrator runs until you stop it. Cloud Data Validation Workflows require an active Orchestrator to complete. However, the Orchestrator can be safely stopped at any point and resumed later; ongoing workflows resume at that point.

Starting the Workers

The Worker (the data-exchange-agent) provides two subcommands: run (default) and test.

Run

Start a Worker that processes tasks:

# Start with a configuration file
data-exchange-agent run -c <configuration-file-path>

# Start with the default configuration.toml in the current directory
data-exchange-agent run

# Omitting the subcommand defaults to 'run' (backward compatible)
data-exchange-agent -c <configuration-file-path>

# Custom parallelism and port
data-exchange-agent run --max-parallel-tasks 8 --port 8080

# Task-handling only, without the HTTP server (for multi-worker setups)
data-exchange-agent run --no-server

# Custom base directory for exported files (overrides config)
data-exchange-agent run --local-results-directory /mnt/dea-exports

# Debug mode on a custom port
data-exchange-agent run --debug --port 5001

The run subcommand accepts the following flags:

Flag

Short

Default

Description

--config

-c

configuration.toml

Path to the TOML configuration file.

--max-parallel-tasks

-w

from config

Maximum number of parallel tasks.

--interval

-i

from config

Interval (seconds) between task fetch attempts.

--host

0.0.0.0

Host to bind the HTTP server to.

--port

-p

5001

Port to bind the HTTP server to.

--no-server

off

Run task handling only, without starting the HTTP server.

--local-results-directory

from config

Base directory for exported files before upload.

--debug

-d

off

Enable debug mode.

See Worker configuration for the configuration file specification.

Test

Before starting a Worker for real, verify that every configured source and target connection works:

data-exchange-agent test -c <configuration-file-path>

This runs SELECT 1 on every configured connection and reports the results.

Workers run until you stop them. Cloud Data Validation Workflows require at least one active Worker with snowflake-data-validation installed. Workers can be safely stopped at any point and resumed later; ongoing workflows resume at that point.

Creating a Cloud Data Validation Workflow

After installing the Orchestrator, create a validation workflow by running the following command:

python -m data_migration_orchestrator create-data-validation-workflow <validation-config-file-path> \
  --source-platform <source-platform> \
  [--name <workflow-name>] \
  [--connection-name <connection-name>]

Keep the following in mind:

  • The validation configuration specification can be found in Validation workflow configuration reference.

  • --source-platform is required. Supported values are sqlserver, redshift, and teradata. The value must match the source_platform field in the JSON configuration file.

  • --name is optional. The workflow name must be alphanumeric and cannot start with a digit. Defaults to MY_VALIDATION_WORKFLOW when omitted.

  • --connection-name is optional. When omitted, the Orchestrator uses the default Snowflake connection from environment variables. When provided, it should match a named connection in your config.toml or connections.toml file.

For example, to create a named SQL Server validation workflow using a specific Snowflake connection:

python -m data_migration_orchestrator create-data-validation-workflow my-validation.json \
  --source-platform sqlserver \
  --name VALIDATE_SALES_Q1 \
  --connection-name my-snowflake

New workflow rows are inserted into the WORKFLOW table in the data migration metadata schema (default SNOWCONVERT_AI.DATA_MIGRATION) with WORKFLOW_TYPE set to data-validation. Validation results and related objects are stored under the data validation metadata schema (default SNOWCONVERT_AI.DATA_VALIDATION).

Monitoring a Cloud Data Validation Workflow

Each workflow goes through different stages throughout its lifecycle:

  • Pending: No tasks have been created for this workflow yet.

  • Executing: Tasks have been created for this workflow and there are still tasks that haven’t reached a terminal state (COMPLETED or FAILED).

  • Completed: All tasks have reached a terminal state (COMPLETED or FAILED).

In the data validation metadata schema (default SNOWCONVERT_AI.DATA_VALIDATION), the following views can be queried to understand the status of validation workflows:

View

Description

TABLE_PROGRESS

One row per validated table. Summarizes overall validation status. Can be filtered by WORKFLOW_ID.

TABLE_PROGRESS_DETAIL

Per-table breakdown with partition-level L2/L3 status (VALID, INVALID, EXECUTION_ERROR). Can be filtered by WORKFLOW_ID.

DATA_VALIDATION_ERROR

Errors encountered during validation. Can be filtered by WORKFLOW_ID.

DATA_VALIDATION_WARNING

Non-fatal warnings, for example, unsupported column types or metric exclusions. Can be filtered by WORKFLOW_ID.

In the same schema, the DATA_VALIDATION_DASHBOARD Streamlit dashboard provides a visual overview of validation progress and results, including a Table Progress tab that aggregates the views above.

You can also inspect validation queries in QUERY_HISTORY using the QUERY_TAG values set by the Orchestrator and Workers. See Query tagging.

Validation outcomes are classified into three categories:

Category

Description

OK

Values match exactly between the source database and Snowflake.

Warning

The Snowflake table has minor differences that don’t affect the data (for example, higher numeric precision).

Error

Values don’t match between the original database and the Snowflake database.

Considerations and recommendations

Connecting to Snowflake with a PAT

Use Programmatic Access Tokens (PATs) for connections used by the Orchestrator and Workers. This avoids the need to constantly authenticate through the browser or with an authenticator app. You need to establish a Network Policy or temporarily bypass the requirement for a Network Policy (which you can do from Snowsight).

Running Orchestrator and Workers on SPCS

To leverage Snowflake compute for these tasks:

  1. Prepare Docker images that use the Python modules with the appropriate configuration.

  2. Push those Docker images to an Image Repository in Snowflake.

  3. Execute the Orchestrator and/or Worker images using Snowpark Container Services.

Keep the following in mind:

  • Run them as Services, not Jobs.

  • You can run only one component (Orchestrator or Workers) in SPCS and the other on a different platform.

  • Monitor the SPCS service and suspend it when it isn’t being used.

  • Depending on the network configuration of the source system, you might need to configure an External Access Integration so that these services can connect to your source system.

Initial testing

For an early test run, use a separate validation configuration whose tables array lists only the table or small set of tables you want to validate. On each of those entries:

  • Set where_clause (and optionally target_where_clause) to an SQL-like predicate so only a subset of rows is validated, for example, a bounded primary-key range or a narrow date range in the source dialect.

  • Keep validation_configuration.row_validation disabled for the first run, and enable it later on a smaller subset of tables.

  • Use a small target_partition_size_mb or target_partition_size_rows to keep partitions tiny during the test.

After you confirm connectivity and results, create your full workflow: remove or relax where_clause, adjust partition sizes, and enable row validation only on the tables where it is needed.

Managing Workers

The time it takes to complete a validation workflow depends on many variables, but the number of Workers (and threads per Worker) has the greatest impact, as it determines how many validation tasks can be executed in parallel. Consider the following:

  • You don’t need to run two Workers on the same machine. If you want more parallelism on a single machine, increase the thread count instead (--max-parallel-tasks on data-exchange-agent run).

  • Network bandwidth greatly affects Worker speed and is shared between threads of a Worker.

  • Even with many Workers and threads running in parallel, the source system might not have enough resources to handle the load.

  • Keep a low Worker count to avoid overloading your source system.

  • Consider stopping some or all Workers when the source system is already under heavy load from other operations.

Query tagging

Both the Orchestrator and the Worker automatically set Snowflake’s QUERY_TAG session parameter on every query they submit. Tags are compact JSON strings containing identifiers such as the workflow ID, task ID, and component version. You can use these tags to filter and attribute validation queries in QUERY_HISTORY:

SELECT query_text, query_tag, start_time
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE TRY_PARSE_JSON(query_tag):DMVF_WORKFLOW_ID IS NOT NULL
ORDER BY start_time DESC;

Tag key

Present on

Description

DMVF_VERSION

Infrastructure queries

Component package version.

DMVF_WORKFLOW_ID

Task-processing queries

Workflow that originated the task.

DMVF_TASK_ID

Task-processing queries

Individual task identifier.

DMVF_ORCHESTRATOR_VERSION

Orchestrator task-processing queries

Orchestrator package version.

DMVF_WORKER_VERSION

Worker task-processing queries

Worker package version.

Configuration reference

Validation workflow configuration reference

The validation workflow configuration file is a JSON object. The following sections describe its structure and properties.

Note

Names that require quoting (or brackets) must be manually quoted as they would normally be in JSON. For example: "tableName": "\"MyCaseSensitiveTable\"".

Top-level object

Property

Type

Required

Description

source_platform

String

Yes

Source dialect identifier: sqlserver, redshift, or teradata. Must match the --source-platform argument used to create the workflow.

target_platform

String

Defaults to Snowflake.

target_database

String

Default target database name for tables that don’t specify one.

validation_configuration

Object

Global validation levels and options. See Validation configuration.

comparison_configuration

Object

Numeric tolerance and optional type mapping file. See Comparison configuration.

database_mappings

Object

Map of source database names to Snowflake database names.

schema_mappings

Object

Map of source schema names to Snowflake schema names.

tables

Array

Yes

At least one table to validate. See Per-table and per-view entry.

views

Array

Additional view entries. Uses the same shape as tables.

use_snowflake_compute

Boolean

When true, enables Snowflake-side computation paths where supported. Default false.

target_partition_size_rows

Integer

Desired rows per partition. Mutually exclusive with target_partition_size_mb. Must be greater than 0. Default is 200 MB when both are omitted.

target_partition_size_mb

Integer

Desired MB per partition. Mutually exclusive with target_partition_size_rows. Must be greater than 0. Default is 200 MB when both are omitted.

use_snowpipe_for_results

Boolean

When true (default), L2/L3 validation results are ingested into the shared results tables via Snowpipe. Set to false to fall back to per-partition COPY INTO tasks.

Validation configuration

The validation_configuration object sets global defaults for which validation levels to run. Any field set here can be overridden per table by nesting a validation_configuration object on that table entry.

When validation_configuration is omitted, the Orchestrator applies these defaults: schema validation and metrics validation are enabled; row validation is disabled; row_validation_mode defaults to row; continue_on_failure defaults to false; max_failed_rows_number defaults to 100; exclude_metrics defaults to false; apply_metric_column_modifier defaults to true.

Property

Type

Description

schema_validation

Boolean

Level 1: schema and column consistency checks.

metrics_validation

Boolean

Level 2: statistical metrics comparison.

row_validation

Boolean

Level 3: row-level or cell-level data comparison.

row_validation_mode

String

For row validation: typically row or cell.

continue_on_failure

Boolean

Whether to continue to the next validation level after a failure.

max_failed_rows_number

Integer

Cap on failed rows reported for L3 validation. Must be greater than 0 when set.

exclude_metrics

Boolean

Whether to exclude unsupported metric columns.

apply_metric_column_modifier

Boolean

Whether to apply metric column modifiers.

Comparison configuration

The comparison_configuration object controls numeric tolerance and optional type-mapping overrides used during comparisons.

Property

Type

Description

tolerance

Number

Numeric comparison tolerance for metrics. Must be greater than 0 when set. Default 0.001 when omitted.

type_mapping_file_path

String

Optional path to a custom type mapping file for comparisons.

Per-table and per-view entry

Each object in tables (or views) describes one database object to validate.

Property

Type

Required

Description

fully_qualified_name

String

Yes

Source object name. The exact format depends on the source platform.

use_column_selection_as_exclude_list

Boolean

Default false.

column_selection_list

String[]

Columns to include or exclude, depending on use_column_selection_as_exclude_list.

target_name

String

Target object name override.

target_database

String

Per-table target database override.

target_schema

String

Per-table target schema override.

where_clause

String

Filter on the source side.

target_where_clause

String

Filter on the target side.

index_column_list

String[]

Columns used to align rows on the source.

target_index_column_list

String[]

Columns used to align rows on the target.

column_mappings

Object

Map of source column name to target column name.

is_case_sensitive

Boolean

Case sensitivity for identifiers.

chunk_number

Integer

Chunking hint for validation. Must be greater than 0 when set.

max_failed_rows_number

Integer

Overrides the global cap for this object.

exclude_metrics

Boolean

Per-object metrics exclusion override.

apply_metric_column_modifier

Boolean

Per-object modifier override.

object_type

String

Typically TABLE or VIEW.

column_names_to_partition_by

String[]

Columns used for range-based (NTILE) partitioning during validation. Without this, the table is processed as a single partition.

target_partition_size_rows

Integer

Per-table override for desired rows per partition. Mutually exclusive with target_partition_size_mb. Must be greater than 0.

target_partition_size_mb

Integer

Per-table override for desired MB per partition. Mutually exclusive with target_partition_size_rows. Must be greater than 0.

validation_configuration

Object

Nested object with the same fields as global validation_configuration to override defaults for this object only.

Partitioning

When column_names_to_partition_by is set, the Orchestrator splits the table into range-based partitions. Cloud Data Validation and Data Migration share the same sizing logic:

  1. Compute a target rows-per-partition from whichever user setting is provided. The two settings are mutually exclusive:

    • target_partition_size_rows is used as-is.

    • target_partition_size_mb is converted to rows using target_mb / avg_row_mb.

    • If neither is set, Cloud Data Validation defaults to 200 MB per partition.

  2. Apply an internal cap. System-imposed maximums (not user-configurable) limit partition size to safe infrastructure bounds.

  3. Derive the partition count as ceil(row_count / effective_rows_per_partition), or 1 when the entire table fits in a single partition.

Example: SQL Server validation

The following configuration runs schema and metrics validation on two SQL Server tables, overrides defaults on the second table, and enables row-level validation there:

{
  "source_platform": "sqlserver",
  "database_mappings": {
    "SampleStoreDB": "samplestoredb"
  },
  "schema_mappings": {
    "data_migration_cloud_test": "data_migration_cloud_test"
  },
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "comparison_configuration": {
    "tolerance": 0.001
  },
  "tables": [
    {
      "fully_qualified_name": "SampleStoreDB.data_migration_cloud_test.store_employee",
      "target_name": "target_employee",
      "column_names_to_partition_by": ["ID"]
    },
    {
      "fully_qualified_name": "SampleStoreDB.data_migration_cloud_test.Sales_Simple",
      "column_names_to_partition_by": ["ID"],
      "index_column_list": ["ID"],
      "target_partition_size_mb": 256,
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "cell",
        "max_failed_rows_number": 500
      },
      "where_clause": "is_deleted = 0"
    }
  ]
}

Example: Redshift validation after UNLOAD migration

The following configuration validates Redshift tables that were migrated using the UNLOAD extraction strategy. Schema and metrics validation are enabled globally, and row validation is enabled on a single high-value table:

{
  "source_platform": "redshift",
  "target_database": "TARGET_DB",
  "schema_mappings": {
    "ecommerce_raw": "ecommerce_raw"
  },
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "tables": [
    {
      "fully_qualified_name": "snowconvert_demo.ecommerce_raw.customers",
      "column_names_to_partition_by": ["customer_id"]
    },
    {
      "fully_qualified_name": "snowconvert_demo.ecommerce_raw.orders",
      "column_names_to_partition_by": ["order_id"],
      "index_column_list": ["order_id"],
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "row"
      }
    }
  ]
}

For the corresponding migration workflow, see Redshift UNLOAD in the Data Migration page.

Example: Teradata validation

The following configuration validates Teradata tables using schema and metrics validation, with a custom tolerance and per-table partition sizing:

{
  "source_platform": "teradata",
  "target_database": "TARGET_DB",
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "comparison_configuration": {
    "tolerance": 0.0001
  },
  "tables": [
    {
      "fully_qualified_name": "tpcds.store_sales",
      "column_names_to_partition_by": ["ss_ticket_number"],
      "target_partition_size_mb": 512
    },
    {
      "fully_qualified_name": "tpcds.customer",
      "target_schema": "PUBLIC",
      "column_names_to_partition_by": ["c_customer_sk"],
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "cell"
      }
    }
  ]
}

Worker configuration

The Worker configuration file uses TOML format and is shared between Data Migration and Cloud Data Validation. Workers that execute validation tasks need the snowflake-data-validation package installed in the same Python environment.

Section

Property

Type

Description

Top level

selected_task_source

String

Required. Must be "snowflake_stored_procedure".

[application]

max_parallel_tasks

Integer

Maximum number of tasks the Worker processes in parallel using threads.

[application]

task_fetch_interval

Integer

Interval in seconds between attempts to fetch new tasks from the Orchestrator.

[application]

lease_refresh_interval

Integer

Optional. Interval in seconds between task lease renewals. Default 120.

[application]

affinity

String

Optional. A user-defined affinity for the Worker.

[application]

snowflake_database_for_metadata

String

Optional. Database where the Orchestrator deployed the task queue. Default SNOWCONVERT_AI. Must match the Orchestrator’s CUSTOM_SNOWFLAKE_DATABASE_FOR_METADATA if you override it.

[application]

snowflake_schema_for_data_migration_metadata

String

Optional. Schema for PULL_TASKS / COMPLETE_TASK / FAIL_TASK. Default DATA_MIGRATION. Must match the Orchestrator’s CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_MIGRATION_METADATA if overridden.

[application]

local_results_directory

String

Optional. Base directory where each task’s exported Parquet or CSV files are written before upload. Defaults to ~/.data_exchange_agent/result_data. Each run uses a subfolder task_<id>/<timestamp>. Tilde (~) and relative paths are expanded at load time.

[connections.source.*]

N/A

Object

Source system connection. Workers typically require an ODBC driver to connect to the source.

[connections.target.snowflake_connection_name]

connection_name

String

The name of the connection entry in your ~/.snowflake/config.toml file.

When selected_task_source is "snowflake_stored_procedure", the Worker calls task-queue stored procedures using application.snowflake_database_for_metadata and application.snowflake_schema_for_data_migration_metadata. These values are independent of the Snowflake session defaults (SNOWFLAKE_DATABASE, SNOWFLAKE_SCHEMA) in the connection profile.

An example configuration file looks like this:

selected_task_source = "snowflake_stored_procedure"

[application]
max_parallel_tasks = 4
task_fetch_interval = 30
# Optional: only if the Orchestrator uses CUSTOM_SNOWFLAKE_* overrides for metadata location
# snowflake_database_for_metadata = "SNOWCONVERT_AI"
# snowflake_schema_for_data_migration_metadata = "DATA_MIGRATION"

# SQL Server connection (standard authentication)
[connections.source.sqlserver]
username = "username"
password = "password"
database = "database_name"
host = "127.0.0.1"
port = 1433

# Snowflake target connection
[connections.target.snowflake_connection_name]
connection_name = "connection_name"

Note

Only one source connection is needed.

Source connection configuration examples

The following examples show the supported source connection types for Cloud Data Validation.

1. SQL Server (standard authentication)

[connections.source.sqlserver]
username = "username"
password = "password"
database = "database_name"
host = "127.0.0.1"
port = 1433

The Worker automatically detects the best available ODBC driver for SQL Server, preferring newer versions (ODBC Driver 18 > 17 > 13 > 11). To manually select a driver, set odbc_driver to the exact name returned by pyodbc.drivers():

[connections.source.sqlserver]
odbc_driver = "ODBC Driver 17 for SQL Server"

The encrypt and trust_server_certificate parameters are optional. When omitted, the ODBC driver uses its default behavior:

  • ODBC Driver 17 and below: encryption is disabled by default.

  • ODBC Driver 18 and above: encryption is mandatory by default.

[connections.source.sqlserver]
username = "sa"
password = "mypassword"
database = "mydb"
host = "my-server.example.com"
port = 1433
encrypt = true
trust_server_certificate = false

For development environments or SQL Servers without encryption support, either omit the encryption parameters or set encrypt = false.

2. Amazon Redshift (IAM authentication)

[connections.source.redshift]
username = "demo-user"
database = "demo_db"
auth_method = "iam-provisioned-cluster"
cluster_id = "my-aws-cluster"
region = "us-west-2"
access_key_id = "your-access-key-id"
secret_access_key = "your-secret-access-key"

3. Amazon Redshift (standard authentication)

[connections.source.redshift]
username = "myuser"
password = "mypassword"
database = "mydatabase"
host = "my-cluster.abcdef123456.us-west-2.redshift.amazonaws.com"
port = 5439
auth_method = "standard"

4. Teradata

The Worker supports two Teradata drivers and automatically selects the best one available:

  1. teradatasql (preferred). Pure Python driver, no OS-level ODBC installation required. Install with pip install teradatasql.

  2. ODBC fallback. If teradatasql isn’t installed, the Worker falls back to pyodbc with the Teradata ODBC driver. Set driver_name to the exact name returned by pyodbc.drivers().

When teradatasql is available, driver_name is ignored and no ODBC driver needs to be installed on the host. Use dbc_name when your Teradata COP or TDPID alias differs from host.

[connections.source.teradata]
host = "your-teradata-host.example.com"
port = 1025
database = "tpcds"
username = "your_username"
password = "your_password"
# driver_name = "Teradata Database ODBC Driver 17.20"  # only needed for ODBC fallback
# dbc_name = "TDPID_ALIAS"  # optional; defaults to host

Command reference

The following table lists every CLI command used to set up and operate Cloud Data Validation.

Command

Purpose

pip install snowflake-data-migration-orchestrator

Install the Orchestrator package.

pip install snowflake-data-exchange-agent

Install the Worker package.

pip install snowflake-data-validation

Install the optional validation dependency required on Workers to process validation tasks.

python -m data_migration_orchestrator start

Start the Orchestrator.

python -m data_migration_orchestrator start --log-destination <dest> --log-file <path> --log-level <level>

Start the Orchestrator with custom logging settings.

python -m data_migration_orchestrator

Start the Orchestrator (backward-compatible; start is implied).

python -m data_migration_orchestrator create-data-validation-workflow <config> --source-platform <platform>

Create a Cloud Data Validation Workflow.

python -m data_migration_orchestrator create-data-migration-workflow <config> --source-platform <platform>

Create a Cloud Data Migration Workflow on the same Orchestrator.

data-exchange-agent run -c <config>

Start a Worker with a specific configuration file.

data-exchange-agent run

Start a Worker using configuration.toml in the current directory.

data-exchange-agent -c <config>

Start a Worker (backward-compatible; run is implied).

data-exchange-agent run --max-parallel-tasks <N> --port <port>

Start a Worker with a custom thread count and HTTP server port.

data-exchange-agent run --no-server

Start a Worker that processes tasks only (no HTTP server).

data-exchange-agent run --local-results-directory <path>

Start a Worker with a custom export directory.

data-exchange-agent run --debug --port <port>

Start a Worker in debug mode.

data-exchange-agent test -c <config>

Test every configured source and target connection (SELECT 1).