Snowflake Data Validation CLI - Complete Usage Guide¶
Overview¶
The Snowflake Data Validation CLI (snowflake-data-validation or sdv) is a comprehensive command-line tool for validating data migrations between source databases (SQL Server, Teradata, Amazon Redshift) and Snowflake. It provides multi-level validation strategies to ensure data consistency and quality.
Key Features¶
Multi-Level Validation: Schema, statistical metrics, and row-by-row data validation
Multiple Source Platforms: SQL Server, Teradata, and Amazon Redshift
Flexible Execution Modes: Synchronous, asynchronous, and script generation
Comprehensive Configuration: YAML-based configuration with extensive customization options
Detailed Reporting: Comprehensive validation reports with mismatch information
Prerequisites¶
Before installing the Snowflake Data Validation CLI, ensure you have the following prerequisites installed:
System Requirements¶
Python: Version 3.8 or higher
pip: Latest version recommended
Operating System: Linux, macOS, or Windows
ODBC Drivers¶
The CLI requires appropriate ODBC drivers to be installed on your system for connecting to source databases. Install the ODBC driver that matches your source database dialect:
SQL Server ODBC Driver¶
For SQL Server as a source database, you need the Microsoft ODBC Driver for SQL Server.
Recommended Version: ODBC Driver 17 or 18 for SQL Server
Installation Instructions:
Linux:
# Ubuntu/Debian curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - curl https://packages.microsoft.com/config/ubuntu/$(lsb_release -rs)/prod.list > /etc/apt/sources.list.d/mssql-release.list apt-get update ACCEPT_EULA=Y apt-get install -y msodbcsql18
macOS:
# Using Homebrew brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release brew update HOMEBREW_ACCEPT_EULA=Y brew install msodbcsql18
Windows: Download and install from Microsoft’s official download page
Verification:
# List available drivers (should show ODBC Driver 17 or 18 for SQL Server)
odbcinst -q -d
Documentation: Microsoft ODBC Driver for SQL Server
Teradata ODBC Driver¶
For Teradata as a source database, you need the Teradata ODBC Driver.
Recommended Version: Teradata ODBC Driver 17.20 or higher
Installation Instructions:
Download the Teradata Tools and Utilities (TTU) package from Teradata Downloads
Select your operating system and download the appropriate installer
Run the installer and select “ODBC Driver” during installation
Configure the driver according to Teradata’s documentation
Note: You may need to create a Teradata account to access the download page.
Configuration: After installation, you may need to configure the ODBC driver:
Linux/macOS: Edit
/etc/odbc.iniand/etc/odbcinst.iniWindows: Use the ODBC Data Source Administrator
Documentation: Teradata ODBC Driver Documentation
Amazon Redshift ODBC Driver¶
For Amazon Redshift as a source database, you need the Amazon Redshift ODBC Driver or a PostgreSQL ODBC Driver (since Redshift is PostgreSQL-compatible).
Option 1: Amazon Redshift ODBC Driver (Recommended)
Recommended Version: Amazon Redshift ODBC Driver 2.x
Installation Instructions:
Download from Amazon Redshift ODBC Driver Download
Choose your operating system and architecture
Follow the installation wizard
Option 2: PostgreSQL ODBC Driver (Alternative)
Linux:
# Ubuntu/Debian sudo apt-get install odbc-postgresql # RHEL/CentOS/Fedora sudo yum install postgresql-odbc
macOS:
# Using Homebrew brew install psqlodbc
Windows: Download from PostgreSQL ODBC Driver
Verification:
# List available drivers (should show Amazon Redshift or PostgreSQL drivers)
odbcinst -q -d
Documentation: Amazon Redshift ODBC Driver Documentation
Additional Tools (Optional)¶
unixODBC (Linux/macOS): Required for ODBC driver management
# Ubuntu/Debian sudo apt-get install unixodbc unixodbc-dev # macOS brew install unixodbc # RHEL/CentOS/Fedora sudo yum install unixODBC unixODBC-devel
Network Access¶
Ensure your environment has network access to:
Source database (SQL Server, Teradata, or Redshift)
Snowflake account
Package repositories (for pip installation)
Installation¶
Base Installation¶
pip install snowflake-data-validation
Source-Specific Installation¶
Install with the appropriate database driver for your source system:
# For SQL Server as source
pip install "snowflake-data-validation[sqlserver]"
# For Teradata as source
pip install "snowflake-data-validation[teradata]"
# For Amazon Redshift as source
pip install "snowflake-data-validation[redshift]"
# For development with all drivers
pip install "snowflake-data-validation[all]"
Post-Installation Verification¶
After installation, verify the CLI is correctly installed:
# Check version
snowflake-data-validation --version
# Or using the alias
sdv --version
# Verify ODBC drivers are accessible
odbcinst -q -d
Quick Start¶
1. Generate a Configuration Template¶
# Get SQL Server configuration templates
snowflake-data-validation sqlserver get-configuration-files
# Get Teradata configuration templates
snowflake-data-validation teradata get-configuration-files
# Get Redshift configuration templates
snowflake-data-validation redshift get-configuration-files
2. Auto-Generate Configuration from Connection¶
# Interactive configuration generation for SQL Server
snowflake-data-validation sqlserver auto-generated-configuration-file
# Interactive configuration generation for Teradata
snowflake-data-validation teradata auto-generated-configuration-file
# Interactive configuration generation for Redshift
snowflake-data-validation redshift auto-generated-configuration-file
3. Run Validation¶
# Run synchronous validation
snowflake-data-validation sqlserver run-validation \
--data-validation-config-file ./config/validation_config.yaml
Best Practices & Guidance¶
This section provides strategic guidance on how to approach data validation effectively, minimize resource consumption, and identify issues early.
Incremental Validation Approach¶
Note
Always start small and scale up incrementally. Running full validation on large datasets immediately can:
Consume significant compute resources on both source and target systems
Take hours or days to complete
Make troubleshooting difficult if issues are found
Impact production systems if run during business hours
Recommended Validation Strategy¶
Follow this proven approach to ensure efficient and effective validation:
Phase 1: Start with a Sample Dataset¶
Goal: Verify configuration and establish baseline
Approach:
Test with 1-2 small tables first (< 100,000 rows)
Choose tables with diverse data types to validate type mapping
Verify connectivity and authentication work correctly
Confirm output format meets your needs
Example Configuration:
tables:
- source_table_name: "small_reference_table"
target_table_name: "small_reference_table"
source_schema_name: "dbo"
target_schema_name: "PUBLIC"
target_database_name: "MIGRATION_DB"
where_clause: "reference_id <= 1000" # Limit to first 1000 rows on source
target_where_clause: "reference_id <= 1000" # Limit to first 1000 rows on target
validations:
- validation_type: "row_count"
- validation_type: "column_level"
What to Verify:
âś… Connection to both source and target successful
âś… Schema validation passes
âś… Row count matches
âś… Data types mapped correctly
âś… Validation report generated successfully
Understanding where_clause vs target_where_clause:
The tool provides two filtering options:
where_clause: Applied only to the source tablewhere_clause: "ProductID <= 45" # Filters only the source table
target_where_clause: Applied only to the target table (Snowflake)target_where_clause: "ProductID <= 45" # Filters only the target table
Common Usage Patterns:
# Pattern 1: Apply same filter to both source and target
tables:
- source_table_name: "products"
where_clause: "ProductID <= 100" # Filter source
target_where_clause: "ProductID <= 100" # Filter target with same condition
validations:
- validation_type: "row_count"
- validation_type: "column_level"
# Pattern 2: Different filters for source and target
# Useful when target has additional test data to exclude
tables:
- source_table_name: "products"
where_clause: "ProductID <= 100" # Filter source
target_where_clause: "ProductID <= 100 AND ProductID != 5" # Exclude test data in target
validations:
- validation_type: "row_count"
- validation_type: "column_level"
# Pattern 3: Filter only source (validate all target data)
tables:
- source_table_name: "products"
where_clause: "ProductID <= 100" # Only validate subset from source
# No target_where_clause - validates all matching rows in target
validations:
- validation_type: "row_count"
Phase 2: Verify Small Subsets of Production Tables¶
Goal: Test against actual production data patterns with limited scope
Approach:
Select a subset of rows from production tables (10,000 - 100,000 rows)
Use
where_clauseandtarget_where_clauseto restrict dataFocus on recent data or specific partitions
Validate critical business columns first
Example Configuration:
tables:
- source_table_name: "large_transactions_table"
target_table_name: "large_transactions_table"
source_schema_name: "sales"
target_schema_name: "SALES"
target_database_name: "MIGRATION_DB"
where_clause: "transaction_date >= '2024-01-01' AND transaction_date < '2024-02-01'"
target_where_clause: "transaction_date >= '2024-01-01' AND transaction_date < '2024-02-01'"
validations:
- validation_type: "row_count"
- validation_type: "column_level"
columns_to_validate:
- "transaction_id"
- "amount"
- "customer_id"
- "transaction_date"
Key Points:
🎯 Use date ranges or ID ranges to limit scope
🎯 Test different data patterns (recent vs. historical, high-volume dates)
🎯 Validate critical business columns before validating all columns
Phase 3: Partition-Based Validation for Large Tables¶
Goal: Validate large tables efficiently using partitioning strategy
⚠️ IMPORTANT: For tables with millions or billions of rows, validating the entire table at once is:
Resource-intensive (high compute costs)
Time-consuming (can take hours/days)
Risky (harder to identify specific issue patterns)
Recommended Approach for Large Tables:
Strategy 1: Date-Based Partitioning
Validate data in chunks based on date ranges:
# Validation 1: January 2024
tables:
- source_table_name: "orders"
target_table_name: "orders"
source_schema_name: "dbo"
target_schema_name: "PUBLIC"
target_database_name: "MIGRATION_DB"
where_clause: "order_date >= '2024-01-01' AND order_date < '2024-02-01'"
target_where_clause: "order_date >= '2024-01-01' AND order_date < '2024-02-01'"
validations:
- validation_type: "row_count"
- validation_type: "column_level"
# Validation 2: February 2024
# Create separate config with:
# where_clause: "order_date >= '2024-02-01' AND order_date < '2024-03-01'"
# target_where_clause: "order_date >= '2024-02-01' AND order_date < '2024-03-01'"
Strategy 2: Modulo-Based Sampling
Use modulo arithmetic to sample evenly distributed rows:
tables:
- source_table_name: "customers"
target_table_name: "customers"
source_schema_name: "dbo"
target_schema_name: "PUBLIC"
target_database_name: "MIGRATION_DB"
where_clause: "customer_id % 100 = 0" # 1% sample - evenly distributed
target_where_clause: "customer_id % 100 = 0"
validations:
- validation_type: "row_count"
- validation_type: "column_level"
Strategy 3: Statistical Sampling
For very large tables (> 100M rows), validate representative samples:
tables:
- source_table_name: "clickstream_events"
target_table_name: "clickstream_events"
source_schema_name: "dbo"
target_schema_name: "PUBLIC"
target_database_name: "MIGRATION_DB"
where_clause: "event_id % 10000 = 0" # ~0.01% sample - statistically distributed
target_where_clause: "event_id % 10000 = 0"
validations:
# First: Validate aggregates and row count (fast)
- validation_type: "row_count"
- validation_type: "aggregate_metrics"
# Second: Validate a statistical sample of rows
- validation_type: "column_level"
Strategy 4: Progressive Partition Validation
Validate multiple partitions progressively:
# config_q1.yaml - Validate Q1 2024
tables:
- source_table_name: "orders"
target_table_name: "orders"
where_clause: "order_date >= '2024-01-01' AND order_date < '2024-04-01'"
target_where_clause: "order_date >= '2024-01-01' AND order_date < '2024-04-01'"
validations:
- validation_type: "row_count"
- validation_type: "column_level"
# config_q2.yaml - Validate Q2 2024 (run after Q1 passes)
tables:
- source_table_name: "orders"
target_table_name: "orders"
where_clause: "order_date >= '2024-04-01' AND order_date < '2024-07-01'"
target_where_clause: "order_date >= '2024-04-01' AND order_date < '2024-07-01'"
validations:
- validation_type: "row_count"
- validation_type: "column_level"
# Validate Q1 2024
sdv sqlserver run-validation --data-validation-config-file config_q1.yaml
# If Q1 passes, validate Q2 2024
sdv sqlserver run-validation --data-validation-config-file config_q2.yaml
# Continue for remaining quarters
Phase 4: Full Validation¶
Goal: Complete comprehensive validation after successful subset testing
When to Run Full Validation:
âś… Sample validations pass successfully
âś… Subset validations show no data quality issues
âś… You have allocated sufficient time and compute resources
âś… Preferably during off-peak hours
Considerations:
Use asynchronous validation for large datasets to avoid timeouts
Consider using script generation mode to run validations in parallel
Monitor resource consumption on both source and target systems
Plan for validation to run during maintenance windows
Example:
# Generate scripts for parallel execution
sdv sqlserver generate-validation-scripts \
--data-validation-config-file full_validation_config.yaml \
--output-directory ./validation_scripts
# Review generated scripts and execute in parallel or scheduled
Performance Optimization Tips¶
1. Use Appropriate Validation Types¶
Not all tables need all validation types:
# For reference/lookup tables (small, static)
validations:
- validation_type: "schema"
- validation_type: "row_count"
- validation_type: "column_level" # Full validation OK
# For large transaction tables
validations:
- validation_type: "schema"
- validation_type: "row_count"
- validation_type: "aggregate_metrics" # Use aggregates instead of row-by-row
2. Prioritize Critical Columns¶
For large tables, validate critical business columns first:
validations:
- validation_type: "column_level"
columns_to_validate:
# Start with business-critical columns
- "customer_id"
- "transaction_amount"
- "transaction_date"
# Add more columns after initial validation succeeds
3. Leverage Partitioning Metadata¶
If your tables are partitioned, validate partition by partition:
# Config 1: Validate partition 1
tables:
- source_table_name: "orders"
target_table_name: "orders"
where_clause: "partition_key = '2024-01'"
target_where_clause: "partition_key = '2024-01'"
validations:
- validation_type: "row_count"
- validation_type: "column_level"
# Config 2: Validate partition 2
tables:
- source_table_name: "orders"
target_table_name: "orders"
where_clause: "partition_key = '2024-02'"
target_where_clause: "partition_key = '2024-02'"
validations:
- validation_type: "row_count"
- validation_type: "column_level"
4. Use Asynchronous Validation for Production¶
For production environments, use asynchronous validation:
# Async validation returns immediately, processes in Snowflake
sdv sqlserver run-async-validation \
--data-validation-config-file config.yaml \
--poll-interval 30 \
--max-wait-time 3600
Cost and Resource Management¶
Estimate Query Costs¶
Before running validation on large tables:
Estimate row counts:
-- Source SELECT COUNT(*) FROM source_table WHERE <partition_filter>; -- Target SELECT COUNT(*) FROM target_table WHERE <partition_filter>;
Estimate data volume:
-- Check table size SELECT table_name, row_count, bytes, bytes / (1024*1024*1024) as size_gb FROM information_schema.tables WHERE table_name = 'your_table';
Start with small partitions: If estimated size > 10 GB, break into smaller chunks
Compute Warehouse Sizing (Snowflake)¶
Small tables (< 1M rows): XS or S warehouse
Medium tables (1M - 10M rows): S or M warehouse
Large tables (> 10M rows): M or L warehouse, use partitioning
Very large tables (> 100M rows): L or XL warehouse, mandatory partitioning
Common Pitfalls to Avoid¶
❌ Don’t Do This |
âś… Do This Instead |
|---|---|
Validate entire 1B row table at once |
Validate in partitions of 10M-100M rows |
Run validation during business hours on production |
Schedule during off-peak hours or use read replicas |
Skip sample testing and go straight to full validation |
Always validate samples first |
Use same configuration for all table sizes |
Tailor validation strategy to table size |
Validate all columns for all tables |
Prioritize critical columns, especially for large tables |
Ignore resource consumption |
Monitor and set appropriate compute resources |
Validation Checklist¶
Use this checklist to ensure you’re following best practices:
Before Starting:
[ ] ODBC drivers installed and tested
[ ] Connectivity to source and target verified
[ ] Configuration template generated
[ ] Test credentials have appropriate read permissions
Initial Testing (Phase 1):
[ ] Selected 1-2 small tables for initial test
[ ] Configuration file created and reviewed
[ ] Sample validation executed successfully
[ ] Validation report reviewed and understood
Subset Validation (Phase 2):
[ ] Identified subset of production data to validate
[ ] Used
where_clauseandtarget_where_clauseto restrict rows[ ] Validated 10,000 - 100,000 rows successfully
[ ] Reviewed results for any data quality issues
Large Table Strategy (Phase 3):
[ ] Identified tables > 10M rows
[ ] Chosen partitioning strategy (date, ID range, modulo)
[ ] Estimated compute costs for validation
[ ] Tested validation on 1-2 partitions first
[ ] Documented partition validation schedule
Production Validation (Phase 4):
[ ] All subset validations passed
[ ] Resource allocation planned (compute, time)
[ ] Validation scheduled during maintenance window
[ ] Using asynchronous or script generation mode
[ ] Monitoring plan in place
Example: Complete Validation Strategy¶
Here’s a complete example of validating a large e-commerce database:
Day 1: Initial Setup and Small Tables
# Test with small reference tables
sdv sqlserver run-validation --data-validation-config-file config_small_tables.yaml
# Tables: product_categories (1K rows), payment_types (50 rows)
Day 2: Subset of Medium Tables
# Test with recent data from medium tables
sdv sqlserver run-validation --data-validation-config-file config_subset_medium.yaml
# Tables: customers WHERE created_date >= '2024-01-01' (50K rows)
# products WHERE product_id <= 10000 (10K rows)
Day 3: Partition Strategy for Large Tables
# Validate one month of orders
sdv sqlserver run-validation --data-validation-config-file config_orders_jan2024.yaml
# Table: orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31' (500K rows)
Day 4-5: Progressive Partition Validation
# Generate scripts for all partitions
sdv sqlserver generate-validation-scripts \
--data-validation-config-file config_all_partitions.yaml \
--output-directory ./scripts
# Review and execute scripts for each partition
Day 6: Full Validation (Off-Peak)
# Run complete validation during weekend maintenance window
sdv sqlserver run-async-validation \
--data-validation-config-file config_full_validation.yaml \
--poll-interval 60 \
--max-wait-time 28800 # 8 hours
CLI Commands¶
Command Structure¶
All commands follow this consistent structure:
snowflake-data-validation <source_dialect> <command> [options]
# Or use the shorter alias
sdv <source_dialect> <command> [options]
Where:
<source_dialect>is one of:sqlserver,teradata,redshift.<command>is one of:run-validation- Run synchronous validationrun-async-validation- Run asynchronous validationgenerate-validation-scripts- Generate validation scriptsget-configuration-files- Get configuration templatesauto-generated-configuration-file- Interactive config generation
Global Options¶
These options can be used with the CLI without specifying a source dialect or command:
Check Version¶
Display the current installed version of the Snowflake Data Validation CLI:
# Using full command name
snowflake-data-validation --version
# Using the alias
sdv --version
Output Example:
snowflake-data-validation 1.2.3
Use Cases:
Verify successful installation
Check which version is currently installed
Confirm version before reporting issues
Ensure compatibility with documentation
Help¶
Display general help information:
# General help
snowflake-data-validation --help
sdv --help
# Command-specific help
sdv sqlserver --help
sdv teradata run-validation --help
sdv redshift generate-validation-scripts --help
Dialect-Specific Command References¶
For detailed command documentation specific to your source database, see the following pages:
SQL Server Commands Reference - Complete command reference for SQL Server migrations
Teradata Commands Reference - Complete command reference for Teradata migrations
Amazon Redshift Commands Reference - Complete command reference for Redshift migrations
Each page provides:
Detailed syntax for all commands
Complete option descriptions with examples
Connection configuration specifics
Dialect-specific examples
Troubleshooting tips for that platform
Best practices for that database type
SQL Server Commands¶
For complete SQL Server command documentation, see SQL Server Commands Reference.
Quick Links:
Common Commands¶
Run Validation:
sdv sqlserver run-validation \
--data-validation-config-file ./configs/sqlserver_validation.yaml
Generate Scripts:
sdv sqlserver generate-validation-scripts \
--data-validation-config-file /path/to/config.yaml
Get Templates:
sdv sqlserver get-configuration-files \
--templates-directory ./templates
For complete documentation, see SQL Server Commands Reference.
Teradata Commands¶
For complete Teradata command documentation, see Teradata Commands Reference.
Quick Links:
Common Commands¶
Run Validation:
sdv teradata run-validation \
--data-validation-config-file ./configs/teradata_validation.yaml
Generate Scripts:
sdv teradata generate-validation-scripts \
./config.yaml \
--output-directory ./scripts
Get Templates:
sdv teradata get-configuration-files \
--templates-directory ./templates
For complete documentation, see Teradata Commands Reference.
Amazon Redshift Commands¶
For complete Amazon Redshift command documentation, see Redshift Commands Reference.
Quick Links:
Common Commands¶
Run Validation:
sdv redshift run-validation \
--data-validation-config-file ./configs/redshift_validation.yaml
Generate Scripts:
sdv redshift generate-validation-scripts \
--data-validation-config-file /path/to/config.yaml
Get Templates:
sdv redshift get-configuration-files \
--templates-directory ./templates
For complete documentation, see Redshift Commands Reference.
Configuration File Reference¶
Global Configuration¶
The global configuration section defines the overall behavior of the validation process.
# Platform configuration
source_platform: SqlServer # Options: SqlServer, Teradata, Redshift, Snowflake
target_platform: Snowflake # Currently only Snowflake is supported
# Output configuration
output_directory_path: /path/to/output/directory
# Threading configuration
max_threads: auto # Options: "auto" or positive integer (1-32)
# Teradata-specific configuration (required only for Teradata)
target_database: TARGET_DB_NAME
# Directory path for source validation file
source_validation_files_path: /path/to/source_validation_file/directory
# Directory path for target validation file
target_validation_files_path: /path/to/target_validation_file/directory
Platform Configuration Options¶
source_platform (required)
Type: String
Valid Values:
SqlServer,Teradata,Redshift,SnowflakeDescription: The source database platform for validation
Example:
source_platform: SqlServer
target_platform (required)
Type: String
Valid Values:
SnowflakeDescription: The target database platform (currently only Snowflake is supported)
Example:
target_platform: Snowflake
output_directory_path (required)
Type: String (path)
Description: Directory where validation results, logs, and reports will be saved
Example:
output_directory_path: /home/user/validation_output
max_threads (optional)
Type: String or Integer
Valid Values:
"auto"or positive integer (1-32)Default:
"auto"Description: Controls parallelization for validation operations
"auto": Automatically detects optimal thread count based on CPU coresInteger value: Specifies exact number of threads to use
Examples:
max_threads: auto # Auto-detect optimal threads max_threads: 4 # Use exactly 4 threads max_threads: 16 # Use 16 threads
target_database (required for Teradata only)
Type: String
Description: Target database name in Snowflake for Teradata validations
Example:
target_database: PROD_DB
source_validation_files_path (optional)
Type: String (path)
Description: Path to the directory containing the source validation files.
Example:
source_validation_files_path: /path/to/source_validation_file/directory
target_validation_files_path (optional)
Type: String (path)
Description: Path to the directory containing the target validation files.
Example:
target_validation_files_path: /path/to/targetvalidation_file/directory
Connection Configuration¶
Define how to connect to source and target databases.
Source Connection Configuration¶
SQL Server Source Connection¶
source_connection:
mode: credentials
host: "sqlserver.company.com"
port: 1433
username: "sqlserver_user"
password: "secure_password"
database: "source_database"
trust_server_certificate: "no" # Optional: yes/no
encrypt: "yes" # Optional: yes/no/optional
Connection Fields:
mode(required)Type: String
Valid Values:
credentialsDescription: Connection mode for SQL Server
host(required)Type: String
Description: SQL Server hostname or IP address
Example:
"sqlserver.company.com"or"192.168.1.100"
port(required)Type: Integer
Default: 1433
Description: SQL Server port number
username(required)Type: String
Description: SQL Server authentication username
password(required)Type: String
Description: SQL Server authentication password
Security Note: Consider using environment variables or secret management
database(required)Type: String
Description: SQL Server database name
trust_server_certificate(optional)Type: String
Valid Values:
"yes","no"Default:
"no"Description: Whether to trust the server certificate for SSL/TLS connections
encrypt(optional)Type: String
Valid Values:
"yes","no","optional"Default:
"yes"Description: Connection encryption setting
Teradata Source Connection¶
source_connection:
mode: credentials
host: "teradata.company.com"
username: "teradata_user"
password: "secure_password"
database: "source_database"
Connection Fields:
mode(required)Type: String
Valid Values:
credentials
host(required)Type: String
Description: Teradata hostname or IP address
username(required)Type: String
Description: Teradata authentication username
password(required)Type: String
Description: Teradata authentication password
database(required)Type: String
Description: Teradata database name
Amazon Redshift Source Connection¶
source_connection:
mode: credentials
host: "redshift-cluster.region.redshift.amazonaws.com"
port: 5439
username: "redshift_user"
password: "secure_password"
database: "source_database"
Connection Fields:
mode(required)Type: String
Valid Values:
credentials
host(required)Type: String
Description: Redshift cluster endpoint
port(required)Type: Integer
Default: 5439
Description: Redshift port number
username(required)Type: String
Description: Redshift authentication username
password(required)Type: String
Description: Redshift authentication password
database(required)Type: String
Description: Redshift database name
Target Connection (Snowflake)¶
Snowflake connections support three modes: name, default, and credentials (IPC only and SnowConvert exclusive).
Option 1: Named Connection¶
Use a pre-configured Snowflake connection saved in your Snowflake connections file.
target_connection:
mode: name
name: "my_snowflake_connection"
Fields:
mode(required): Must be"name"name(required): Name of the saved Snowflake connection
Option 2: Default Connection¶
Use the default Snowflake connection from your environment.
target_connection:
mode: default
Fields:
mode(required): Must be"default"
Option 3: Credentials Mode (IPC Only)¶
Note: The
credentialsmode is only available when using IPC (Inter-Process Communication) commands directly via CLI parameters, not in YAML configuration files. This mode is exclusive to the SnowConvert UI.
Validation Configuration¶
Controls which validation levels are executed.
validation_configuration:
schema_validation: true # Level 1: Schema validation
metrics_validation: true # Level 2: Statistical metrics validation
row_validation: false # Level 3: Row-level data validation
max_failed_rows_number: 100 # Maximum failed rows to report (applies only for row validation)
exclude_metrics: false # Exclude statistical metrics from validation
apply_metric_column_modifier: false # Apply column modifiers for metrics
custom_templates_path: /path/to/templates # Optional: Custom query templates
Validation Options:
schema_validation(optional)Type: Boolean
Default:
trueDescription: Validates table and column schema consistency
Checks:
Column names match between source and target
Data types are compatible
Column nullability settings
Primary key definitions
metrics_validation(optional)Type: Boolean
Default:
trueDescription: Validates statistical metrics for each column
Checks:
Row counts
Distinct value counts
Null value counts
Min/max values
Average, sum, standard deviation (for numeric columns)
row_validation(optional)Type: Boolean
Default:
falseDescription: Validates data at the row level using hash-based comparison
Note: Requires index columns for row identification. If not specified in the configuration, the tool attempts to auto-detect them from primary keys.
Warning: This is the most resource-intensive validation level
Checks:
MD5 hash comparison of row chunks
Identifies specific rows with differences
max_failed_rows_number(optional)Type: Integer
Default: 100
Minimum: 1
Description: Maximum number of failed rows to report per table
Example:
max_failed_rows_number: 250
exclude_metrics(optional)Type: Boolean
Default:
falseDescription: When
true, excludes certain statistical metrics (avg, sum, stddev, variance) from validationUse Case: Useful for large tables where statistical calculations might cause an overflow.
apply_metric_column_modifier(optional)Type: Boolean
Default:
trueDescription: Applies column modifiers defined in metric templates
Use Case: Advanced users with custom metric calculations
custom_templates_path(optional)Type: String (path)
Description: Path to directory containing custom Jinja2 query templates
Example:
custom_templates_path: /opt/validation/custom_templates
Table Configuration¶
Defines which tables to validate and how to validate them.
tables:
# Table 1: Include specific columns
- fully_qualified_name: database.schema.table1
use_column_selection_as_exclude_list: false
column_selection_list:
- customer_id
- customer_name
- email
index_column_list:
- customer_id
where_clause: "status = 'ACTIVE'"
target_where_clause: "status = 'ACTIVE'"
is_case_sensitive: false
chunk_number: 10
max_failed_rows_number: 50
column_mappings:
customer_id: cust_id
customer_name: name
# Table 2: Exclude specific columns
- fully_qualified_name: database.schema.table2
use_column_selection_as_exclude_list: true
column_selection_list:
- audit_timestamp
- created_by
- modified_by
index_column_list: []
# Table 3: Simple configuration
- fully_qualified_name: database.schema.table3
use_column_selection_as_exclude_list: false
column_selection_list: []
Table Configuration Fields:
fully_qualified_name(required)Type: String
Format:
database.schema.tableorschema.tableDescription: Full table identifier in the source database
Examples:
fully_qualified_name: my_database.dbo.customers fully_qualified_name: public.orders # For Redshift
target_database(optional)Type: String
Default: Source database name from fully_qualified_name field
Description: Target database name if different from source database name
Example:
target_database: target_database_name
target_schema(optional)Type: String
Default: Source schema name from fully_qualified_name field
Description: Target schema name if different from source schema name
Example:
target_schema: target_schema_name
target_name(optional)Type: String
Default: Source table name from fully_qualified_name field
Description: Target table name if different from source table name
Example:
target_name: customers_new
use_column_selection_as_exclude_list(required)Type: Boolean
Description: Determines how
column_selection_listis interpretedfalse: Include only the specified columnstrue: Exclude the specified columns (include all others)
Examples:
# Include only these columns use_column_selection_as_exclude_list: false column_selection_list: - id - name - email # Exclude these columns (include all others) use_column_selection_as_exclude_list: true column_selection_list: - internal_notes - audit_timestamp
column_selection_list(required)Type: List of strings
Description: List of column names to include or exclude
Note: Use an empty list
[]to include all columns
index_column_list(optional)Type: List of strings
Default: Auto-detected from primary keys
Description: Columns to use as unique identifiers for row validation
Use Case: Specify when the table doesn’t have a primary key or you want to use different columns
Example:
index_column_list: - customer_id - order_date
target_index_column_list(optional)Type: List of strings
Description: Index columns in the target table (if different from source)
Note: Automatically derived from
column_mappingsif not specified
where_clause(optional)Type: String
Default:
""(empty, no filter)Description: SQL WHERE clause to filter source data (without “WHERE” keyword)
Examples:
where_clause: "created_date >= '2024-01-01'" where_clause: "status IN ('ACTIVE', 'PENDING') AND region = 'US'" where_clause: "amount > 1000 AND customer_type = 'PREMIUM'"
target_where_clause(optional)Type: String
Default:
""(empty, no filter)Description: SQL WHERE clause to filter target data
Best Practice: Should match
where_clauseto ensure consistent comparisonExample:
target_where_clause: "created_date >= '2024-01-01'"
is_case_sensitive(optional)Type: Boolean
Default:
falseDescription: Whether column name matching should be case-sensitive
chunk_number(optional)Type: Integer
Default: 0 (no chunking)
Minimum: 0
Description: Number of chunks to split row validation into
Use Case: Large tables benefit from chunking for better performance
Example:
chunk_number: 20 # Split into 20 chunks for parallel processing
max_failed_rows_number(optional)Type: Integer
Minimum: 1
Description: Maximum failed rows to report for this specific table
Note: Overrides the global
max_failed_rows_numbersetting
column_mappings(optional)Type: Dictionary (key-value pairs)
Description: Maps source column names to target column names when they differ
Format:
source_column_name: target_column_nameExample:
column_mappings: cust_id: customer_id cust_name: customer_name addr: address
exclude_metrics(optional)Type: Boolean
Description: Exclude metrics validation for this specific table
Note: Overrides the global
exclude_metricssetting
apply_metric_column_modifier(optional)Type: Boolean
Description: Apply column modifiers for this specific table
Note: Overrides the global
apply_metric_column_modifiersetting
Comparison Configuration¶
Controls comparison behavior and tolerance levels.
comparison_configuration:
tolerance: 0.01 # 1% tolerance for statistical comparisons
Comparison Options:
tolerance(optional)Type: Float
Default: 0.001 (0.1%)
Description: Acceptable tolerance for statistical metric differences
Use Case: Allows for small differences due to rounding or data type conversions
Examples:
tolerance: 0.001 # 0.1% tolerance (very strict) tolerance: 0.01 # 1% tolerance (recommended) tolerance: 0.05 # 5% tolerance (lenient)
Logging Configuration¶
Controls logging behavior for validation operations.
logging_configuration:
level: INFO # Overall logging level
console_level: WARNING # Console-specific level
file_level: DEBUG # File-specific level
Logging Options:
level(optional)Type: String
Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICALDefault:
INFODescription: Default logging level for all loggers
Level Descriptions:
DEBUG: Detailed diagnostic informationINFO: General informational messagesWARNING: Warning messages for potentially problematic situationsERROR: Error messages for serious issuesCRITICAL: Critical errors that may cause application failure
console_level(optional)Type: String
Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICALDefault: Same as
levelDescription: Logging level for console output
Use Case: Set to
WARNINGorERRORto reduce console noise
file_level(optional)Type: String
Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICALDefault: Same as
levelDescription: Logging level for file output
Use Case: Set to
DEBUGfor detailed file logs while keeping console clean
Example Configurations:
# Verbose logging for troubleshooting
logging_configuration:
level: DEBUG
console_level: DEBUG
file_level: DEBUG
# Production-friendly logging
logging_configuration:
level: INFO
console_level: WARNING
file_level: INFO
# Minimal console output with detailed file logs
logging_configuration:
level: INFO
console_level: ERROR
file_level: DEBUG
Note: CLI --log-level parameter overrides configuration file settings.
Database and Schema Mappings¶
Map source database/schema names to target names when they differ.
database_mappings:
source_db1: target_db1
source_db2: target_db2
legacy_database: modern_database
schema_mappings:
dbo: public
source_schema: target_schema
old_schema: new_schema
Mapping Options:
database_mappings(optional)Type: Dictionary
Description: Maps source database names to target database names
Use Case: When database names differ between source and Snowflake
Example:
database_mappings: PROD_SQL: PROD_SNOWFLAKE DEV_SQL: DEV_SNOWFLAKE
schema_mappings(optional)Type: Dictionary
Description: Maps source schema names to target schema names
Use Case: When schema names differ between source and Snowflake
Example:
schema_mappings: dbo: PUBLIC sales: SALES_DATA hr: HUMAN_RESOURCES
Complete Configuration Examples¶
Example 1: SQL Server to Snowflake - Basic Validation¶
# Global configuration
source_platform: SqlServer
target_platform: Snowflake
output_directory_path: ./validation_results
max_threads: auto
# Source connection
source_connection:
mode: credentials
host: sqlserver.company.com
port: 1433
username: sql_user
password: sql_password
database: production_db
trust_server_certificate: "no"
encrypt: "yes"
# Target connection
target_connection:
mode: name
name: snowflake_prod
# Validation configuration
validation_configuration:
schema_validation: true
metrics_validation: true
row_validation: false
# Comparison configuration
comparison_configuration:
tolerance: 0.01
# Tables to validate
tables:
- fully_qualified_name: production_db.dbo.customers
use_column_selection_as_exclude_list: false
column_selection_list: []
index_column_list:
- customer_id
- fully_qualified_name: production_db.dbo.orders
use_column_selection_as_exclude_list: true
column_selection_list:
- internal_notes
- audit_log
where_clause: "order_date >= '2024-01-01'"
target_where_clause: "order_date >= '2024-01-01'"
Example 2: Teradata to Snowflake - Comprehensive Validation¶
# Global configuration
source_platform: Teradata
target_platform: Snowflake
output_directory_path: /opt/validation/results
max_threads: 8
target_database: PROD_SNOWFLAKE
# Source connection
source_connection:
mode: credentials
host: teradata.company.com
username: teradata_user
password: teradata_password
database: prod_db
# Target connection
target_connection:
mode: default
# Validation configuration
validation_configuration:
schema_validation: true
metrics_validation: true
row_validation: true
max_failed_rows_number: 100
exclude_metrics: false
apply_metric_column_modifier: false
# Comparison configuration
comparison_configuration:
tolerance: 0.005
# Logging configuration
logging_configuration:
level: INFO
console_level: WARNING
file_level: DEBUG
# Schema mappings
schema_mappings:
prod_db: PUBLIC
# Tables configuration
tables:
- fully_qualified_name: prod_db.sales_data
use_column_selection_as_exclude_list: false
column_selection_list:
- transaction_id
- customer_id
- amount
- transaction_date
index_column_list:
- transaction_id
chunk_number: 10
max_failed_rows_number: 50
- fully_qualified_name: prod_db.customer_master
use_column_selection_as_exclude_list: true
column_selection_list:
- ssn
- credit_card
where_clause: "status = 'ACTIVE'"
target_where_clause: "status = 'ACTIVE'"
column_mappings:
cust_id: customer_id
cust_name: customer_name
Example 3: Redshift to Snowflake - Advanced Configuration¶
# Global configuration
source_platform: Redshift
target_platform: Snowflake
output_directory_path: /data/validation/redshift_migration
max_threads: 16
# Source connection
source_connection:
mode: credentials
host: redshift-cluster.us-east-1.redshift.amazonaws.com
port: 5439
username: redshift_admin
password: redshift_secure_password
database: analytics_db
# Target connection
target_connection:
mode: name
name: snowflake_analytics
# Validation configuration
validation_configuration:
schema_validation: true
metrics_validation: true
row_validation: true
max_failed_rows_number: 200
exclude_metrics: false
# Comparison configuration
comparison_configuration:
tolerance: 0.02
# Logging configuration
logging_configuration:
level: INFO
console_level: ERROR
file_level: DEBUG
# Database mappings
database_mappings:
analytics_db: ANALYTICS_PROD
# Schema mappings
schema_mappings:
public: PUBLIC
staging: STAGING
# Tables configuration
tables:
# Large fact table with chunking
- fully_qualified_name: analytics_db.public.fact_sales
use_column_selection_as_exclude_list: false
column_selection_list: []
index_column_list:
- sale_id
chunk_number: 50
max_failed_rows_number: 500
# Dimension table with column mappings
- fully_qualified_name: analytics_db.public.dim_customer
use_column_selection_as_exclude_list: false
column_selection_list:
- customer_key
- customer_name
- email
- phone
- address
column_mappings:
customer_key: cust_key
customer_name: name
is_case_sensitive: false
# Filtered validation
- fully_qualified_name: analytics_db.staging.incremental_load
use_column_selection_as_exclude_list: true
column_selection_list:
- load_timestamp
- etl_batch_id
where_clause: "load_date >= CURRENT_DATE - 7"
target_where_clause: "load_date >= CURRENT_DATE - 7"
chunk_number: 10
Example 4: Minimal Configuration¶
source_platform: SqlServer
target_platform: Snowflake
output_directory_path: ./output
source_connection:
mode: credentials
host: localhost
port: 1433
username: sa
password: password
database: test_db
target_connection:
mode: default
validation_configuration:
schema_validation: true
metrics_validation: true
row_validation: false
tables:
- fully_qualified_name: test_db.dbo.test_table
use_column_selection_as_exclude_list: false
column_selection_list: []
Advanced Usage¶
Working with Large Tables¶
For large tables, consider these optimization strategies:
Enable Chunking¶
tables:
- fully_qualified_name: database.schema.large_table
use_column_selection_as_exclude_list: false
column_selection_list: []
chunk_number: 100 # Split into 100 chunks
index_column_list:
- primary_key_column
Increase Thread Count¶
max_threads: 32 # Use maximum threads for parallel processing
3. Filter Data¶
tables:
- fully_qualified_name: database.schema.large_table
use_column_selection_as_exclude_list: false
column_selection_list: []
where_clause: "created_date >= '2024-01-01' AND region = 'US'"
target_where_clause: "created_date >= '2024-01-01' AND region = 'US'"
Selective Column Validation¶
tables:
- fully_qualified_name: database.schema.large_table
use_column_selection_as_exclude_list: true
column_selection_list:
- large_text_column
- large_binary_column
- xml_column
Using Custom Query Templates¶
For advanced users, you can provide custom Jinja2 templates:
validation_configuration:
schema_validation: true
metrics_validation: true
row_validation: true
custom_templates_path: /opt/custom_templates
Custom template directory structure:
/opt/custom_templates/
├── sqlserver_columns_metrics_query.sql.j2
├── sqlserver_row_count_query.sql.j2
├── sqlserver_compute_md5_sql.j2
└── snowflake_columns_metrics_query.sql.j2
Asynchronous Validation Workflow¶
For environments with restricted database access or long-running validations:
Step 1: Generate Scripts¶
sdv sqlserver generate-validation-scripts \
--data-validation-config-file config.yaml
This generates SQL scripts in the output directory.
Step 2: Execute Scripts Manually¶
Execute the generated scripts on source and target databases, saving results to CSV files.
Step 3: Run Async Validation¶
sdv sqlserver run-async-validation \
--data-validation-config-file config.yaml
This compares the pre-generated metadata files.
CI/CD Integration¶
Integrate validation into your deployment pipeline:
#!/bin/bash
# validate_migration.sh
CONFIG_FILE="./configs/validation_config.yaml"
LOG_LEVEL="INFO"
# Run validation
sdv sqlserver run-validation \
--data-validation-config-file "$CONFIG_FILE" \
--log-level "$LOG_LEVEL"
# Check exit code
if [ $? -eq 0 ]; then
echo "âś“ Validation passed successfully"
exit 0
else
echo "âś— Validation failed"
exit 1
fi
Handling Multiple Environments¶
Create separate configuration files for each environment:
configs/
├── dev_validation.yaml
├── staging_validation.yaml
└── prod_validation.yaml
Run validation for specific environment:
# Development
sdv sqlserver run-validation --data-validation-config-file configs/dev_validation.yaml
# Staging
sdv sqlserver run-validation --data-validation-config-file configs/staging_validation.yaml
# Production
sdv sqlserver run-validation --data-validation-config-file configs/prod_validation.yaml
Validation Reports¶
Overview¶
The Snowflake Data Validation tool generates comprehensive CSV reports that document the results of data migration validations between source and target databases. These reports help identify discrepancies in schema, metrics, and row-level data.
Report Types¶
The validation tool generates different types of reports based on the validation levels configured:
1. Main Validation Report (validation_report.csv)¶
This consolidated report contains results from schema and metrics validations for all tables.
2. Row Validation Reports (per table)¶
Separate reports generated for each table when row validation is enabled, containing detailed row-level comparison results.
Main Validation Report Structure¶
The main validation report contains the following columns:
Column Name |
Description |
|---|---|
VALIDATION_TYPE |
Type of validation performed: |
TABLE |
Fully qualified name of the table being validated (e.g., |
COLUMN_VALIDATED |
Name of the column being validated (or table-level attribute for schema validation) |
EVALUATION_CRITERIA |
The specific property being compared (e.g., |
SOURCE_VALUE |
The value from the source database |
SNOWFLAKE_VALUE |
The value from the target (Snowflake) database |
STATUS |
Validation result status (see Status Values section below) |
COMMENTS |
Additional context or explanation for the validation result |
Validation Types Explained¶
Schema Validation¶
Compares structural metadata between source and target tables:
Column existence: Ensures columns present in source exist in target
Data types: Validates column data types match (with configurable mappings)
Nullability: Checks if NULL constraints match between source and target
Primary keys: Verifies primary key definitions
Column precision/scale: Validates numeric precision and scale values
Character length: Compares VARCHAR/CHAR column lengths
Example Schema Validation Row:
SCHEMA VALIDATION,mydb.myschema.customers,customer_id,DATA_TYPE,INT,NUMBER,FAILURE,"Data type mismatch detected"
Metrics Validation¶
Compares statistical metrics calculated on column data:
Row count: Total number of rows in the table
min: Minimum value in numeric/date columns
max: Maximum value in numeric/date columns
count: Count of non-null values
count_distinct: Number of distinct values
avg, sum, stddev, variance: Statistical measures (can be excluded via configuration)
Example Metrics Validation Row:
METRICS VALIDATION,mydb.myschema.orders,order_date,max,2024-12-31,2024-12-31,SUCCESS,""
Row Validation¶
Generates separate per-table reports comparing actual row data using MD5 checksums to detect differences.
Row Validation Report Structure¶
Row validation reports have a different structure focused on identifying specific rows with differences:
Column Name |
Description |
|---|---|
ROW_NUMBER |
Sequential row number in the report |
TABLE_NAME |
Fully qualified table name |
RESULT |
Outcome of the row comparison (see Result Values below) |
[INDEX_COLUMNS]_SOURCE |
Primary key/index column values from source |
[INDEX_COLUMNS]_TARGET |
Primary key/index column values from target |
SOURCE_QUERY |
SQL query to retrieve the row from source database |
TARGET_QUERY |
SQL query to retrieve the row from target database |
Status Values¶
The STATUS column in the main validation report can have the following values:
Status |
Meaning |
|---|---|
SUCCESS |
Validation passed - values match between source and target |
FAILURE |
Validation failed - values differ between source and target |
WARNING |
Potential issue detected that may require attention |
NOT_FOUND_SOURCE |
Element exists in target but not in source |
NOT_FOUND_TARGET |
Element exists in source but not in target |
Result Values (Row Validation)¶
The RESULT column in row validation reports can have the following values:
Result |
Meaning |
|---|---|
SUCCESS |
Row data matches between source and target |
FAILURE |
Row data differs between source and target (MD5 checksum mismatch) |
NOT_FOUND_SOURCE |
Row exists in target but not in source |
NOT_FOUND_TARGET |
Row exists in source but not in target |
Understanding Validation Results¶
Interpreting Schema Validation Results¶
Success Scenario:
All columns exist in both source and target
Data types match (considering configured type mappings)
Nullability constraints are consistent
All structural attributes align
Common Failure Scenarios:
Data Type Mismatch
Source:
VARCHAR(50), Target:VARCHAR(100)Status: May be SUCCESS if within tolerance, or FAILURE if strict matching is required
Missing Column
Source has column
phone_number, target does notStatus: NOT_FOUND_TARGET
Nullability Difference
Source:
NOT NULL, Target:NULLStatus: FAILURE
Interpreting Metrics Validation Results¶
Success Scenario:
Row counts match exactly
Statistical metrics are within configured tolerance (default: 0.1%)
All calculated metrics align between source and target
Common Failure Scenarios:
Row Count Mismatch
Source: 10,000 rows, Target: 9,998 rows
Status: FAILURE
Action: Investigate missing rows
Min/Max Value Differences
Source max date:
2024-12-31, Target max date:2024-12-30Status: FAILURE
Action: Check for incomplete data migration
Statistical Variance
Source count_distinct: 1,000, Target count_distinct: 995
Status: FAILURE (if beyond tolerance)
Action: Investigate potential duplicates or missing values
Interpreting Row Validation Results¶
Success Scenario:
All rows in source have matching rows in target (by MD5 checksum)
No orphaned rows in either database
Primary key values align correctly
Common Failure Scenarios:
Row Content Mismatch
Same primary key, different column values
Result: FAILURE
Action: Use provided SQL queries to investigate specific differences
Missing Rows
Row exists in source but not in target
Result: NOT_FOUND_TARGET
Action: Check migration completeness
Extra Rows
Row exists in target but not in source
Result: NOT_FOUND_SOURCE
Action: Investigate unexpected data in target
Using the Reports¶
Quick Assessment¶
Filter by STATUS column: Focus on
FAILURE,WARNING,NOT_FOUND_SOURCE, andNOT_FOUND_TARGETrowsGroup by VALIDATION_TYPE: Assess schema issues separately from metrics issues
Group by TABLE: Identify which tables have the most issues
Investigating Failures¶
For Schema Validation:
Review the
EVALUATION_CRITERIAto understand what attribute failedCompare
SOURCE_VALUEvsSNOWFLAKE_VALUECheck if differences are acceptable (e.g., VARCHAR size increase)
Update type mappings or schema definitions if needed
For Metrics Validation:
Review the metric that failed (e.g.,
row_count,max,min)Calculate the difference magnitude
Determine if within acceptable business tolerance
Use the detailed queries to investigate source of discrepancy
For Row Validation:
Open the table-specific row validation report
Identify rows with
FAILUREstatusUse the provided
SOURCE_QUERYandTARGET_QUERYto retrieve actual row dataCompare column-by-column to identify specific field differences
Investigate why values differ (data type conversion, truncation, transformation)
Configuration Options Affecting Reports¶
Tolerance Settings¶
The comparison_configuration.tolerance setting affects metrics validation:
comparison_configuration:
tolerance: 0.01 # 1% tolerance for numeric comparisons
Values within tolerance are marked as SUCCESS
Values beyond tolerance are marked as FAILURE
Validation Levels¶
Control which validations run and therefore which reports are generated:
validation_configuration:
schema_validation: true # Validates table/column structure
metrics_validation: true # Validates statistical metrics
row_validation: false # Validates individual row data (resource intensive)
Excluded Metrics¶
Exclude specific metrics from validation:
validation_configuration:
exclude_metrics: true # Excludes avg, sum, stddev, variance
Maximum Failed Rows¶
Limit the number of failed rows reported in row validation:
validation_configuration:
max_failed_rows_number: 100 # Report up to 100 failed rows per table
Report File Locations¶
Reports are generated in the configured output directory:
<output_directory_path>/
├── <timestamp>_validation_report.csv # Main consolidated report
├── <timestamp>_database.schema.table1_1.csv # Row validation for table1
├── <timestamp>_database.schema.table2_2.csv # Row validation for table2
└── data_validation_<timestamp>.log # Detailed execution log
File naming convention:
Timestamp format:
YYYY-MM-DD_HH-MM-SSRow validation reports include table name and unique ID to prevent collisions
Best Practices¶
Start with Schema Validation: Ensure structural alignment before validating data
Use Appropriate Tolerance: Set realistic tolerance thresholds for metrics validation
Selective Row Validation: Enable row validation only for critical tables (resource intensive)
Iterative Approach: Fix schema issues first, then metrics, then row-level differences
Document Acceptable Differences: Some type conversions or value transformations may be expected
Automate Report Analysis: Use scripts to parse CSV reports and flag critical issues
Preserve Reports: Archive validation reports for audit trails and compliance
Troubleshooting Report Issues¶
All Validations Showing FAILURE¶
Possible Causes:
Incorrect database/schema mappings in configuration
Type mapping file not loaded correctly
Connection to wrong target database
Solution: Verify database_mappings, schema_mappings, and connection settings
Row Validation Shows All NOT_FOUND_TARGET¶
Possible Causes:
Target table empty or not migrated yet
Incorrect target table name
Primary key/index columns mismatch
Solution: Verify target table exists and contains data, check column mappings
Metrics Validation Shows Large Differences¶
Possible Causes:
Incomplete data migration
Data type conversion issues causing value changes
Filter/WHERE clause differences between source and target queries
Solution: Review migration logs, verify row counts first, check data transformations
Report File Not Generated¶
Possible Causes:
Output directory doesn’t exist or lacks write permissions
Validation configuration has all levels set to false
Application crashed before report generation
Solution: Check output path permissions, review logs for errors, enable at least one validation level
Troubleshooting¶
Common Issues and Solutions¶
Issue: “Configuration file not found”¶
Symptom:
Configuration file not found: ./config.yaml
Solution:
Verify the file path is correct
Use absolute paths:
/home/user/configs/validation.yamlCheck file permissions
Issue: “Connection error”¶
Symptom:
Connection error: Unable to connect to database
Solutions:
Verify connection parameters:
source_connection: host: correct-hostname.com # Verify hostname port: 1433 # Verify port username: valid_user # Verify username password: correct_password # Verify password
Check network connectivity:
# Test connection to SQL Server telnet sqlserver.company.com 1433 # Test connection to Teradata telnet teradata.company.com 1025 # Test connection to Redshift telnet redshift-cluster.amazonaws.com 5439
Verify firewall rules allow connections from your machine
For SQL Server SSL issues:
source_connection: trust_server_certificate: "yes" # Try this if SSL errors occur encrypt: "optional"
Issue: “Invalid parameter”¶
Symptom:
Invalid parameter: max_threads must be 'auto' or a positive integer
Solution:
# Correct
max_threads: auto
max_threads: 4
# Incorrect
max_threads: "4" # Remove quotes
max_threads: 0 # Must be positive
max_threads: -1 # Must be positive
Issue: “Table not found”¶
Symptom:
Table not found: database.schema.table_name
Solutions:
Verify fully qualified name:
# For SQL Server and Teradata fully_qualified_name: database_name.schema_name.table_name # For Redshift (no database in FQN) fully_qualified_name: schema_name.table_name
Check case sensitivity:
tables: - fully_qualified_name: MyDatabase.MySchema.MyTable is_case_sensitive: true # Match exact case
Verify table exists in source database
Issue: “YAML formatting error”¶
Symptom:
Error in the format of config.yaml
Solutions:
Check indentation (use spaces, not tabs)
# Correct tables: - fully_qualified_name: db.schema.table column_selection_list: - column1 - column2
Incorrect example (mixed indentation with tabs):
tables: - fully_qualified_name: db.schema.table column_selection_list: - column1
Quote special characters:
password: "p@ssw0rd!" # Quote passwords with special chars where_clause: "name = 'O''Brien'" # Escape quotes
Validate YAML syntax using online validators or:
python -c "import yaml; yaml.safe_load(open('config.yaml'))"
Issue: “Validation fails with tolerance errors”¶
Symptom:
Metrics validation failed: Difference exceeds tolerance
Solution: Adjust tolerance in configuration:
comparison_configuration:
tolerance: 0.05 # Increase to 5% tolerance
Issue: “Out of memory errors with large tables”¶
Solutions:
Enable chunking:
tables: - fully_qualified_name: large_table chunk_number: 100 # Process in smaller chunks
Reduce thread count:
max_threads: 4 # Use fewer threads
Filter data:
tables: - fully_qualified_name: large_table where_clause: "created_date >= '2024-01-01'"
Exclude large columns:
tables: - fully_qualified_name: large_table use_column_selection_as_exclude_list: true column_selection_list: - large_blob_column - large_text_column
Getting Help¶
Check logs: Review log files in the output directory
Enable debug logging:
sdv sqlserver run-validation \ --data-validation-config-file config.yaml \ --log-level DEBUG
Review validation reports in the output directory
Consult documentation: Full Documentation
Report issues: Email us at:snowconvert-support@snowflake.com