SnowConvert AI CLI - Data validation¶

Tip

This page is a SnowConvert AI CLI user-guide overview. For the full, up-to-date reference (Oracle, PostgreSQL, Teradata complete configuration, thresholds, column regex filtering, Worker generate-config, and dashboard reporting), see Data validation (technical documentation).

Data Validation uses the data-validation skill from the Snowflake AIM Agent for Data Warehouses. For more details, contact aim-support@snowflake.com.

The Data Validation feature provides a fault-tolerant, scalable way to verify that data migrated into Snowflake matches the data in the original source system. It runs as a Cloud Data Validation workflow on the same infrastructure used by Data Migration, so you can migrate and validate with the same Orchestrator and Workers.

Cloud Data Validation is designed for migration scenarios where you are moving data from a system that you plan to decommission and need confidence that the migrated data is correct before cutting over. Supported source platforms are SQL Server, Amazon Redshift, Teradata, Oracle, and PostgreSQL. Azure Synapse support is planned.

Architecture overview¶

Cloud Data Validation uses the same two components as Data Migration: an Orchestrator and one or more Workers.

The Orchestrator connects to the Snowflake account. It requires privileges to create and operate the SNOWCONVERT_AI database, where workflow, task, and validation metadata is stored.
One or more Workers connect to both the source system and the Snowflake account. Workers run validation queries on both sides, export intermediate results, upload them to Snowflake, and write their outcomes to shared results tables. Workers pick up tasks created by the Orchestrator and process them in parallel.
Validation results (schema, metrics, and row-level outcomes) are ingested into the shared results tables through Snowpipe by default and then evaluated by the Orchestrator.

Each Worker must have data validation capability enabled to execute validation tasks. Workers without it can still run Data Migration tasks but skip any data_validation tasks queued for them.

Deployment options¶

The Orchestrator and Workers can be deployed in multiple ways:

Both on Snowpark Container Services (in the Snowflake account).
Both in the customer’s environment, including custom hardware, virtual machines, or containers.
Orchestrator on Snowpark Container Services and Workers in the customer’s environment, or the other way around.

The following requirements apply to the environment where the Orchestrator and Workers run:

Workers typically require an ODBC driver to connect to the source system. For Teradata, the pure-Python teradatasql driver is supported and preferred when installed.
The Orchestrator must be able to connect to the Snowflake account using a role that has privileges to create the SNOWCONVERT_AI database and create schemas and objects within it.

Validation levels¶

Cloud Data Validation performs comparisons between each source table (or view) and its corresponding target in Snowflake at three increasingly detailed levels. Each level can be enabled or disabled independently in the workflow configuration.

Schema validation (L1)¶

Schema validation confirms that the structure of each migrated table is preserved in Snowflake. It compares the following attributes between source and target:

Table name
Column names
Ordinal position of each column
Data types
Character maximum length for text columns
Numeric precision and scale for numeric columns
Nullability
Row count

Metrics validation (L2)¶

Metrics validation confirms that aggregate statistics of the migrated data match the original source. Specific metrics vary by column data type, but metrics validation typically compares:

Row count (table level)
Minimum and maximum value
Sum and average
Null count and distinct count
Standard deviation and variance

Row validation (L3)¶

Row validation performs row-level or cell-level comparison between source and target. It supports:

hybrid (default when L3 is enabled): MD5 fingerprint first, then cell drilldown on failures.
row: whole-row MD5 comparison using index_column_list.
cell: per-column mismatch reporting.

A hybrid mode (two-phase MD5 plus cell drilldown) reduces cost versus pure cell on large tables when supported. Row validation is disabled by default and is typically applied only to the tables where it is needed because it is the most resource-intensive level.

Prerequisites¶

Before you use Cloud Data Validation, make sure the following are in place:

SCAI CLI: The SnowConvert AI (SCAI) CLI must be installed. Use scai data orchestrator setup and scai data worker setup to deploy the Orchestrator and Workers (see Installation).
Snowflake access: Connections for the Orchestrator and Workers in your Snowflake config.toml or connections.toml, using a role that can create the SNOWCONVERT_AI database and its objects. The first time the Orchestrator starts, it creates that database and related resources if they don’t exist yet. On later runs, use a role that can administer SNOWCONVERT_AI and its objects; sticking with the same role you used for the initial creation is the simplest way to avoid permission issues.
Source connectivity: For SQL Server and Redshift sources, an ODBC driver on the machine where Workers run. For Teradata, install the teradatasql Python driver (preferred) or configure pyodbc with the Teradata ODBC driver. Programmatic Access Tokens (PATs) are recommended for Snowflake connections; see Connecting to Snowflake with a PAT.
Target data available: The tables you want to validate must already exist in Snowflake, either because you migrated them using Data Migration or by another means. For accurate validation, don’t alter the migrated data between the migration and the validation run.
Snowpark Container Services (optional): If you deploy the Orchestrator or Workers on Snowflake compute, your account needs SPCS. See the Snowpark Container Services overview. Running both components outside Snowflake doesn’t require SPCS.

Warning

For accurate validation and to avoid false negatives, don’t alter the migrated data during the validation process.

Setup¶

Installation¶

Use the SCAI CLI to deploy the Orchestrator and Workers. Cloud deployment requires Snowpark Container Services (SPCS) to be enabled on the Snowflake account.

# Generate Worker TOML from your project (recommended):
scai data worker generate-config .scai/settings/DataExchangeWorkerConfig.toml

# Set up the Orchestrator on SPCS (first time):
scai data orchestrator setup --compute-pool MY_COMPUTE_POOL

# Set up the Worker SPCS service (first time):
scai data worker setup --compute-pool MY_COMPUTE_POOL

For local deployment (without SPCS), use scai data worker start --local. See Starting the Workers.

Usage¶

To validate migrated data using this solution, complete the following high-level steps:

Start the Orchestrator.
Start the Workers.
Create a Cloud Data Validation Workflow.
Monitor the validation workflow until completion.

A Cloud Data Validation Workflow is a job submitted to the system that describes which tables to validate, at which levels, and with which comparison rules. You can submit multiple workflows simultaneously and monitor them. The Orchestrator breaks each workflow into smaller tasks, one or more per table, and dispatches them to available Workers.

The Orchestrator that runs validation is the same process that runs Data Migration. If you have already set up and started an Orchestrator and Workers for migration, you can reuse them for validation.

Starting the Orchestrator¶

Use these commands to set up and start the Orchestrator.

# Set up the Orchestrator on SPCS (first time):
scai data orchestrator setup --compute-pool MY_COMPUTE_POOL

# Set up with a specific connection:
scai data orchestrator setup --compute-pool MY_COMPUTE_POOL --connection my-snowflake

# Resume a stopped Orchestrator:
scai data orchestrator start

# Stop the Orchestrator:
scai data orchestrator stop

The Orchestrator runs until you stop it. Cloud Data Validation Workflows require an active Orchestrator to complete. However, the Orchestrator can be safely stopped at any point and resumed later; ongoing workflows resume at that point.

Starting the Workers¶

Use these commands to set up and start the Workers.

# Set up Worker SPCS service (first time):
scai data worker setup --compute-pool MY_COMPUTE_POOL

# Set up with a specific connection:
scai data worker setup --compute-pool MY_COMPUTE_POOL --connection my-snowflake

# Resume/start the Worker service:
scai data worker start

# Run a local Worker (optional, for customer-environment deployment):
scai data worker start --local my-worker-config.toml

# Auto-generate local worker config:
scai data worker start --local --auto-config

Workers run until you stop them. Cloud Data Validation Workflows require at least one active Worker. Workers can be safely stopped at any point and resumed later; ongoing workflows resume at that point.

Creating a Cloud Data Validation Workflow¶

Generate a Validation Configuration based on the state of your SCAI project. You can also create one by hand or ask Snowflake CoCo for help. See Validation workflow configuration reference for the full specification.

# Generate config for all tables:
scai data validate generate-config

# Filter tables by schema:
scai data validate generate-config --where "source.schema = 'public'"

# Custom output path:
scai data validate generate-config -o my-config.yaml

Once you have a Validation Configuration, create and start a validation workflow.

# Start validation (returns immediately):
scai data validate start --connection my-snowflake

# Start validation with custom config:
scai data validate start --config my-data-validation-config.yaml --connection my-snowflake

# Create workflow and wait for completion:
scai data validate create-workflow --config my-data-validation-config.yaml --connection my-snowflake --watch

If you want to migrate data before validating it, use scai data migrate start. See Data migration for details.

Monitoring a Cloud Data Validation Workflow¶

Each workflow goes through different stages throughout its lifecycle:

Pending: No tasks have been created for this workflow yet.
Executing: Tasks have been created for this workflow and there are still tasks that haven’t reached a terminal state (COMPLETED or FAILED).
Completed: All tasks have reached a terminal state (COMPLETED or FAILED).

# Check workflow status:
scai data validate status DATA_VALIDATION_WORKFLOW_xx_yy_zz

# Watch workflow progress:
scai data validate status DATA_VALIDATION_WORKFLOW_xx_yy_zz --watch

# List all validation workflows:
scai data validate list

# Filter by status:
scai data validate list --status failed

In the data validation metadata schema (default SNOWCONVERT_AI.DATA_VALIDATION), the following views can be queried to understand the status of validation workflows:

View	Description
`TABLE_PROGRESS`	One row per validated table. Summarizes overall validation status. Can be filtered by `WORKFLOW_ID`.
`TABLE_PROGRESS_DETAIL`	Per-table breakdown with partition-level L2/L3 status (`VALID`, `INVALID`, `EXECUTION_ERROR`). Can be filtered by `WORKFLOW_ID`.
`DATA_VALIDATION_ERROR`	Errors encountered during validation. Can be filtered by `WORKFLOW_ID`.
`DATA_VALIDATION_WARNING`	Non-fatal warnings, for example, unsupported column types or metric exclusions. Can be filtered by `WORKFLOW_ID`.

In the same schema, the DATA_VALIDATION_DASHBOARD Streamlit dashboard provides a visual overview of validation progress and results, including a Table Progress tab that aggregates the views above.

You can also inspect validation queries in QUERY_HISTORY using the QUERY_TAG values set by the Orchestrator and Workers. See Query tagging.

Validation outcomes are classified into three categories:

Category	Description
OK	Values match exactly between the source database and Snowflake.
Warning	The Snowflake table has minor differences that don’t affect the data (for example, higher numeric precision).
Error	Values don’t match between the original database and the Snowflake database.

When L3 row validation is enabled, mismatches can be classified as FAILURE, DUPLICATE_SOURCE, or DUPLICATE_TARGET. For console progress, CSV reports (output_directory_path), dashboard tabs, and max_threads, see Reporting and parallelism on the technical documentation page.

Considerations and recommendations¶

Connecting to Snowflake with a PAT¶

Use Programmatic Access Tokens (PATs) for connections used by the Orchestrator and Workers. This avoids the need to constantly authenticate through the browser or with an authenticator app. You need to establish a Network Policy or temporarily bypass the requirement for a Network Policy (which you can do from Snowsight).

Running Orchestrator and Workers on SPCS¶

Use scai data orchestrator setup and scai data worker setup to deploy on Snowpark Container Services (SPCS). These commands create the required SPCS services on the specified compute pool and wait for them to reach a running state.

Keep the following in mind:

You can run only one component (Orchestrator or Workers) in SPCS and the other locally using scai data orchestrator start --local or scai data worker start --local.
Monitor the SPCS service and suspend it when it isn’t being used (scai data orchestrator stop and scai data worker stop).
Depending on the network configuration of the source system, you might need to configure an External Access Integration so that these services can connect to your source system.

Initial testing¶

For an early test run, use a separate validation configuration whose tables array lists only the table or small set of tables you want to validate. On each of those entries:

Set where_clause (and optionally target_where_clause) to an SQL-like predicate so only a subset of rows is validated, for example, a bounded primary-key range or a narrow date range in the source dialect.
Keep validation_configuration.row_validation disabled for the first run, and enable it later on a smaller subset of tables.
Use a small target_partition_size_mb or target_partition_size_rows to keep partitions tiny during the test.

After you confirm connectivity and results, create your full workflow: remove or relax where_clause, adjust partition sizes, and enable row validation only on the tables where it is needed.

Managing Workers¶

The time it takes to complete a validation workflow depends on many variables, but the number of Workers (and threads per Worker) has the greatest impact, as it determines how many validation tasks can be executed in parallel. Consider the following:

You don’t need to run two Workers on the same machine. If you want more parallelism on a single machine, increase the thread count instead (set max_parallel_tasks in the worker configuration).
Network bandwidth greatly affects Worker speed and is shared between threads of a Worker.
Even with many Workers and threads running in parallel, the source system might not have enough resources to handle the load.
Keep a low Worker count to avoid overloading your source system.
Consider stopping some or all Workers when the source system is already under heavy load from other operations.

Query tagging¶

Both the Orchestrator and the Worker automatically set Snowflake’s QUERY_TAG session parameter on every query they submit. Tags are compact JSON strings containing identifiers such as the workflow ID, task ID, and component version. You can use these tags to filter and attribute validation queries in QUERY_HISTORY:

SELECT query_text, query_tag, start_time
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE TRY_PARSE_JSON(query_tag):DMVF_WORKFLOW_ID IS NOT NULL
ORDER BY start_time DESC;

Tag key	Present on	Description
`DMVF_VERSION`	Infrastructure queries	Component package version.
`DMVF_WORKFLOW_ID`	Task-processing queries	Workflow that originated the task.
`DMVF_TASK_ID`	Task-processing queries	Individual task identifier.
`DMVF_ORCHESTRATOR_VERSION`	Orchestrator task-processing queries	Orchestrator package version.
`DMVF_WORKER_VERSION`	Worker task-processing queries	Worker package version.

Configuration reference¶

Validation workflow file overview¶

Validation is driven by a single YAML or JSON file. Key sections include source_platform, validation_configuration (L1/L2/L3 toggles and thresholds), comparison_configuration, tables, and optional views. Generate a starter file with scai data validate generate-config -o .scai/config/data-validation-config.yaml.

For the full specification (including max_threads, output_directory_path, column regex filtering, and per-table fields), see Validation workflow file overview and Validation workflow configuration reference on the technical documentation page.

Validation workflow configuration reference¶

The validation workflow configuration file is a JSON object. The following sections describe its structure and properties.

Note

Names that require quoting (or brackets) must be manually quoted as they would normally be in JSON. For example: "tableName": "\"MyCaseSensitiveTable\"".

Top-level object¶

Property	Type	Required	Description
`source_platform`	`String`	Yes	Source dialect identifier: `sqlserver`, `redshift`, `teradata`, `oracle`, or `postgresql`. Must match the `--source-platform` argument used to create the workflow.
`target_platform`	`String`		Defaults to `Snowflake`.
`target_database`	`String`		Default target database name for tables that don’t specify one.
`validation_configuration`	`Object`		Global validation levels and options. See Validation configuration.
`comparison_configuration`	`Object`		Numeric tolerance and optional type mapping file. See Comparison configuration.
`database_mappings`	`Object`		Map of source database names to Snowflake database names.
`schema_mappings`	`Object`		Map of source schema names to Snowflake schema names.
`tables`	`Array`	Yes	At least one table to validate. See Per-table and per-view entry.
`views`	`Array`		Additional view entries. Uses the same shape as `tables`.
`use_snowflake_compute`	`Boolean`		When `true`, enables Snowflake-side computation paths where supported. Default `false`.
`target_partition_size_rows`	`Integer`		Desired rows per partition. Mutually exclusive with `target_partition_size_mb`. Must be greater than 0. Default is 200 MB when both are omitted.
`target_partition_size_mb`	`Integer`		Desired MB per partition. Mutually exclusive with `target_partition_size_rows`. Must be greater than 0. Default is 200 MB when both are omitted.
`use_snowpipe_for_results`	`Boolean`		When `true` (default), L2/L3 validation results are ingested into the shared results tables through Snowpipe. Set to `false` to fall back to per-partition `COPY INTO` tasks.

Validation configuration¶

The validation_configuration object sets global defaults for which validation levels to run. Any field set here can be overridden per table by nesting a validation_configuration object on that table entry.

When validation_configuration is omitted, the Orchestrator applies these defaults: schema validation and metrics validation are enabled; row validation is disabled; when L3 is enabled, row_validation_mode defaults to hybrid; continue_on_failure defaults to false; max_failed_rows_number defaults to 1000; exclude_metrics defaults to false; apply_metric_column_modifier defaults to true.

Property	Type	Description
`schema_validation`	`Boolean`	Level 1: schema and column consistency checks.
`metrics_validation`	`Boolean`	Level 2: statistical metrics comparison.
`row_validation`	`Boolean`	Level 3: row-level or cell-level data comparison.
`row_validation_mode`	`String`	For row validation: typically `row` or `cell`.
`continue_on_failure`	`Boolean`	Whether to continue to the next validation level after a failure.
`max_failed_rows_number`	`Integer`	Cap on failed rows reported for L3 validation. Must be greater than 0 when set.
`exclude_metrics`	`Boolean`	Whether to exclude unsupported metric columns.
`apply_metric_column_modifier`	`Boolean`	Whether to apply metric column modifiers.

Comparison configuration¶

The comparison_configuration object controls numeric tolerance and optional type-mapping overrides used during comparisons.

Property	Type	Description
`tolerance`	`Number`	Numeric comparison tolerance for metrics. Must be greater than 0 when set. Default `0.001` when omitted.
`type_mapping_file_path`	`String`	Optional path to a custom type mapping file for comparisons.

Per-table and per-view entry¶

Each object in tables (or views) describes one database object to validate.

Property	Type	Required	Description
`fully_qualified_name`	`String`	Yes	Source object name. The exact format depends on the source platform.
`use_column_selection_as_exclude_list`	`Boolean`		Default `false`.
`column_selection_list`	`String[]`		Columns to include or exclude, depending on `use_column_selection_as_exclude_list`.
`target_name`	`String`		Target object name override.
`target_database`	`String`		Per-table target database override.
`target_schema`	`String`		Per-table target schema override.
`where_clause`	`String`		Filter on the source side.
`target_where_clause`	`String`		Filter on the target side.
`index_column_list`	`String[]`		Columns used to align rows on the source.
`target_index_column_list`	`String[]`		Columns used to align rows on the target.
`column_mappings`	`Object`		Map of source column name to target column name.
`is_case_sensitive`	`Boolean`		Case sensitivity for identifiers.
`chunk_number`	`Integer`		Chunking hint for validation. Must be greater than 0 when set.
`max_failed_rows_number`	`Integer`		Overrides the global cap for this object.
`exclude_metrics`	`Boolean`		Per-object metrics exclusion override.
`apply_metric_column_modifier`	`Boolean`		Per-object modifier override.
`object_type`	`String`		Typically `TABLE` or `VIEW`.
`column_names_to_partition_by`	`String[]`		Columns used for range-based (`NTILE`) partitioning during validation. Without this, the table is processed as a single partition.
`target_partition_size_rows`	`Integer`		Per-table override for desired rows per partition. Mutually exclusive with `target_partition_size_mb`. Must be greater than 0.
`target_partition_size_mb`	`Integer`		Per-table override for desired MB per partition. Mutually exclusive with `target_partition_size_rows`. Must be greater than 0.
`validation_configuration`	`Object`		Nested object with the same fields as global `validation_configuration` to override defaults for this object only.

Partitioning¶

When column_names_to_partition_by is set, the Orchestrator splits the table into range-based partitions. Cloud Data Validation and Data Migration share the same sizing logic:

Compute a target rows-per-partition from whichever user setting is provided. The two settings are mutually exclusive:
- target_partition_size_rows is used as-is.
- target_partition_size_mb is converted to rows using target_mb / avg_row_mb.
- If neither is set, Cloud Data Validation defaults to 200 MB per partition.
Apply an internal cap. System-imposed maximums (not user-configurable) limit partition size to safe infrastructure bounds.
Derive the partition count as ceil(row_count / effective_rows_per_partition), or 1 when the entire table fits in a single partition.

Example: SQL Server validation¶

The following configuration runs schema and metrics validation on two SQL Server tables, overrides defaults on the second table, and enables row-level validation there:

{
  "source_platform": "sqlserver",
  "database_mappings": {
    "SampleStoreDB": "samplestoredb"
  },
  "schema_mappings": {
    "data_migration_cloud_test": "data_migration_cloud_test"
  },
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "comparison_configuration": {
    "tolerance": 0.001
  },
  "tables": [
    {
      "fully_qualified_name": "SampleStoreDB.data_migration_cloud_test.store_employee",
      "target_name": "target_employee",
      "column_names_to_partition_by": ["ID"]
    },
    {
      "fully_qualified_name": "SampleStoreDB.data_migration_cloud_test.Sales_Simple",
      "column_names_to_partition_by": ["ID"],
      "index_column_list": ["ID"],
      "target_partition_size_mb": 256,
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "cell",
        "max_failed_rows_number": 500
      },
      "where_clause": "is_deleted = 0"
    }
  ]
}

Example: Redshift validation after UNLOAD migration¶

The following configuration validates Redshift tables that were migrated using the UNLOAD extraction strategy. Schema and metrics validation are enabled globally, and row validation is enabled on a single high-value table:

{
  "source_platform": "redshift",
  "target_database": "TARGET_DB",
  "schema_mappings": {
    "ecommerce_raw": "ecommerce_raw"
  },
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "tables": [
    {
      "fully_qualified_name": "snowconvert_demo.ecommerce_raw.customers",
      "column_names_to_partition_by": ["customer_id"]
    },
    {
      "fully_qualified_name": "snowconvert_demo.ecommerce_raw.orders",
      "column_names_to_partition_by": ["order_id"],
      "index_column_list": ["order_id"],
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "row"
      }
    }
  ]
}

For the corresponding migration workflow, see Redshift UNLOAD in the Data Migration page.

Example: Teradata validation¶

The following configuration validates Teradata tables using schema and metrics validation, with a custom tolerance and per-table partition sizing:

{
  "source_platform": "teradata",
  "target_database": "TARGET_DB",
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "comparison_configuration": {
    "tolerance": 0.0001
  },
  "tables": [
    {
      "fully_qualified_name": "tpcds.store_sales",
      "column_names_to_partition_by": ["ss_ticket_number"],
      "target_partition_size_mb": 512
    },
    {
      "fully_qualified_name": "tpcds.customer",
      "target_schema": "PUBLIC",
      "column_names_to_partition_by": ["c_customer_sk"],
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "cell"
      }
    }
  ]
}

Worker configuration¶

The Worker configuration file uses TOML format and is shared between Data Migration and Cloud Data Validation. This configuration is used when running a local Worker with scai data worker start --local.

Section	Property	Type	Description
Top level	`selected_task_source`	`String`	Required. Must be `"snowflake_stored_procedure"`.
`[application]`	`max_parallel_tasks`	`Integer`	Maximum number of tasks the Worker processes in parallel using threads.
`[application]`	`task_fetch_interval`	`Integer`	Interval in seconds between attempts to fetch new tasks from the Orchestrator.
`[application]`	`lease_refresh_interval`	`Integer`	Optional. Interval in seconds between task lease renewals. Default `120`.
`[application]`	`affinity`	`String`	Optional. A user-defined affinity for the Worker.
`[application]`	`snowflake_database_for_metadata`	`String`	Optional. Database where the Orchestrator deployed the task queue. Default `SNOWCONVERT_AI`. Must match the Orchestrator’s `CUSTOM_SNOWFLAKE_DATABASE_FOR_METADATA` if you override it.
`[application]`	`snowflake_schema_for_data_migration_metadata`	`String`	Optional. Schema for `PULL_TASKS` / `COMPLETE_TASK` / `FAIL_TASK`. Default `DATA_MIGRATION`. Must match the Orchestrator’s `CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_MIGRATION_METADATA` if overridden.
`[application]`	`local_results_directory`	`String`	Optional. Base directory where each task’s exported Parquet or CSV files are written before upload. Defaults to `~/.data_exchange_agent/result_data`. Each run uses a subfolder `task_<id>/<timestamp>`. Tilde (`~`) and relative paths are expanded at load time.
`[connections.source.*]`	N/A	`Object`	Source system connection. Workers typically require an ODBC driver to connect to the source.
`[connections.target.snowflake_connection_name]`	`connection_name`	`String`	The name of the connection entry in your `~/.snowflake/config.toml` file.

When selected_task_source is "snowflake_stored_procedure", the Worker calls task-queue stored procedures using application.snowflake_database_for_metadata and application.snowflake_schema_for_data_migration_metadata. These values are independent of the Snowflake session defaults (SNOWFLAKE_DATABASE, SNOWFLAKE_SCHEMA) in the connection profile.

An example configuration file looks like this:

selected_task_source = "snowflake_stored_procedure"

[task_source.snowflake_stored_procedure]
connection_name = "connection_name"

[application]
max_parallel_tasks = 4
task_fetch_interval = 30
# Optional: only if the Orchestrator uses CUSTOM_SNOWFLAKE_* overrides for metadata location
# snowflake_database_for_metadata = "SNOWCONVERT_AI"
# snowflake_schema_for_data_migration_metadata = "DATA_MIGRATION"

# SQL Server connection (standard authentication)
[connections.source.sqlserver]
username = "username"
password = "password"
database = "database_name"
host = "127.0.0.1"
port = 1433

# Snowflake target connection
[connections.target.snowflake_connection_name]
connection_name = "connection_name"

Note

Only one source connection is needed.

Source connection configuration examples¶

The following examples show the supported source connection types for Cloud Data Validation.

1. SQL Server (standard authentication)

[connections.source.sqlserver]
username = "username"
password = "password"
database = "database_name"
host = "127.0.0.1"
port = 1433

The Worker automatically detects the best available ODBC driver for SQL Server, preferring newer versions (ODBC Driver 18 > 17 > 13 > 11). To manually select a driver, set odbc_driver to the exact name returned by pyodbc.drivers():

[connections.source.sqlserver]
odbc_driver = "ODBC Driver 17 for SQL Server"

The encrypt and trust_server_certificate parameters are optional. When omitted, the ODBC driver uses its default behavior:

ODBC Driver 17 and below: encryption is disabled by default.
ODBC Driver 18 and above: encryption is mandatory by default.

[connections.source.sqlserver]
username = "sa"
password = "mypassword"
database = "mydb"
host = "my-server.example.com"
port = 1433
encrypt = true
trust_server_certificate = false

For development environments or SQL Servers without encryption support, either omit the encryption parameters or set encrypt = false.

2. Amazon Redshift (IAM authentication)

[connections.source.redshift]
username = "demo-user"
database = "demo_db"
auth_method = "iam-provisioned-cluster"
cluster_id = "my-aws-cluster"
region = "us-west-2"
access_key_id = "your-access-key-id"
secret_access_key = "your-secret-access-key"

3. Amazon Redshift (standard authentication)

[connections.source.redshift]
username = "myuser"
password = "mypassword"
database = "mydatabase"
host = "my-cluster.abcdef123456.us-west-2.redshift.amazonaws.com"
port = 5439
auth_method = "standard"

4. Teradata

The Worker prefers teradatasql (pure Python, no ODBC install). If unavailable, it falls back to pyodbc with a registered Teradata ODBC driver. Pin the driver with odbc_driver (exact name from your OS driver manager). Use dbc_name when your COP or TDPID alias differs from host. Optional authentication (LDAP, KRB5, or omit for TD2).

For migration, Teradata supports regular, tpt, and write_nos extraction strategies. Validation reads live Teradata over SQL and does not require tbuild or write_nos_* TOML unless the same Worker also runs migration tasks using those strategies.

For the complete Teradata reference (Worker TOML, TPT, WRITE_NOS, validation YAML, view L1 limits, type mappings), see Teradata: Complete configuration reference.

[connections.source.teradata]
host = "your-teradata-host.example.com"
port = 1025
database = "tpcds"
username = "your_username"
password = "your_password"
# odbc_driver = "Teradata Database ODBC Driver 17.20"  # when not using teradatasql
# dbc_name = "TDPID_ALIAS"
# authentication = "LDAP"

Command reference¶

The following table lists every CLI command used to set up and operate Cloud Data Validation.

Command	Purpose
`scai data orchestrator setup --compute-pool <POOL>`	Set up the Orchestrator SPCS service (first time).
`scai data orchestrator setup --compute-pool <POOL> --connection <name>`	Set up the Orchestrator with a specific Snowflake connection.
`scai data orchestrator start`	Resume the Orchestrator SPCS service.
`scai data orchestrator start --local`	Run the Orchestrator as a local process.
`scai data orchestrator stop`	Suspend the Orchestrator SPCS service.
`scai data worker setup --compute-pool <POOL>`	Set up the Worker SPCS service (first time).
`scai data worker start`	Resume the Worker SPCS service.
`scai data worker start --local <config>`	Run a local Worker with a configuration file.
`scai data worker start --local --auto-config`	Run a local Worker with auto-generated configuration.
`scai data worker stop`	Suspend the Worker SPCS service.
`scai data validate generate-config`	Generate a validation configuration file.
`scai data validate generate-config --where "source.schema = 'x'"`	Generate a config filtered by schema.
`scai data validate generate-config -o <path>`	Generate a config to a custom output path.
`scai data validate start --connection <name>`	Start a Cloud Data Validation workflow.
`scai data validate start --config <path> --connection <name>`	Start validation with a custom configuration file.
`scai data validate create-workflow --config <path> --connection <name>`	Create a validation workflow explicitly (returns immediately).
`scai data validate create-workflow --config <path> --connection <name> --watch`	Create a workflow and wait for completion.
`scai data validate status <WORKFLOW_NAME>`	Check workflow status.
`scai data validate status <WORKFLOW_NAME> --watch`	Watch workflow progress until complete.
`scai data validate list`	List all validation workflows.
`scai data validate list --status <STATUS>`	Filter workflows by status.
`scai data validate pause <WORKFLOW_NAME>`	Pause a running validation workflow.
`scai data validate resume <WORKFLOW_NAME>`	Resume a paused validation workflow.
`scai data validate cancel <WORKFLOW_NAME>`	Cancel a validation workflow.