Snowflake Data Validation CLI - Complete Usage Guide¶
Overview¶
The Snowflake Data Validation CLI (snowflake-data-validation or sdv) is a comprehensive command-line tool for validating data migrations between source databases (SQL Server, Teradata, Amazon Redshift) and Snowflake. It provides multi-level validation strategies to ensure data consistency and quality.
Key Features¶
Multi-Level Validation: Schema, statistical metrics, and row-by-row data validation
Multiple Source Platforms: SQL Server, Teradata, and Amazon Redshift
Tables and Views Validation: Validate both tables and database views
Flexible Execution Modes: Synchronous, asynchronous, and script generation
Comprehensive Configuration: YAML-based configuration with extensive customization options
Detailed Reporting: Comprehensive validation reports with mismatch information
Prerequisites¶
Before installing the Snowflake Data Validation CLI, ensure you have the following prerequisites installed:
System Requirements¶
Python: Version 3.8 or higher
pip: Latest version recommended
Operating System: Linux, macOS, or Windows
ODBC Drivers¶
The CLI requires appropriate ODBC drivers to be installed on your system for connecting to source databases. Install the ODBC driver that matches your source database dialect:
SQL Server ODBC Driver¶
For SQL Server as a source database, you need the Microsoft ODBC Driver for SQL Server.
Recommended Version: ODBC Driver 17 or 18 for SQL Server
Installation Instructions:
Linux:
macOS:
Windows: Download and install from Microsoft’s official download page
Verification:
Documentation: Microsoft ODBC Driver for SQL Server
Teradata ODBC Driver¶
For Teradata as a source database, you need the Teradata ODBC Driver.
Recommended Version: Teradata ODBC Driver 17.20 or higher
Installation Instructions:
Download the Teradata Tools and Utilities (TTU) package from Teradata Downloads
Select your operating system and download the appropriate installer
Run the installer and select “ODBC Driver” during installation
Configure the driver according to Teradata’s documentation
Note: You may need to create a Teradata account to access the download page.
Configuration: After installation, you may need to configure the ODBC driver:
Linux/macOS: Edit
/etc/odbc.iniand/etc/odbcinst.iniWindows: Use the ODBC Data Source Administrator
Documentation: Teradata ODBC Driver Documentation
Amazon Redshift ODBC Driver¶
For Amazon Redshift as a source database, you need the Amazon Redshift ODBC Driver or a PostgreSQL ODBC Driver (since Redshift is PostgreSQL-compatible).
Option 1: Amazon Redshift ODBC Driver (Recommended)
Recommended Version: Amazon Redshift ODBC Driver 2.x
Installation Instructions:
Download from Amazon Redshift ODBC Driver Download
Choose your operating system and architecture
Follow the installation wizard
Option 2: PostgreSQL ODBC Driver (Alternative)
Linux:
macOS:
Windows: Download from PostgreSQL ODBC Driver
Verification:
Documentation: Amazon Redshift ODBC Driver Documentation
Additional Tools (Optional)¶
unixODBC (Linux/macOS): Required for ODBC driver management
Network Access¶
Ensure your environment has network access to:
Source database (SQL Server, Teradata, or Redshift)
Snowflake account
Package repositories (for pip installation)
Installation¶
Base Installation¶
Source-Specific Installation¶
Install with the appropriate database driver for your source system:
Post-Installation Verification¶
After installation, verify the CLI is correctly installed:
Quick Start¶
1. Generate a Configuration Template¶
2. Auto-Generate Configuration from Connection¶
3. Run Validation¶
Best Practices & Guidance¶
This section provides strategic guidance on how to approach data validation effectively, minimize resource consumption, and identify issues early.
Incremental Validation Approach¶
Note
Always start small and scale up incrementally. Running full validation on large datasets immediately can:
Consume significant compute resources on both source and target systems
Take hours or days to complete
Make troubleshooting difficult if issues are found
Impact production systems if run during business hours
Recommended Validation Strategy¶
Follow this proven approach to ensure efficient and effective validation:
Phase 1: Start with a Sample Dataset¶
Goal: Verify configuration and establish baseline
Approach:
Test with 1-2 small tables first (< 100,000 rows)
Choose tables with diverse data types to validate type mapping
Verify connectivity and authentication work correctly
Confirm output format meets your needs
Example Configuration:
What to Verify:
✅ Connection to both source and target successful
✅ Schema validation passes
✅ Row count matches
✅ Data types mapped correctly
✅ Validation report generated successfully
Understanding where_clause vs target_where_clause:
The tool provides two filtering options:
where_clause: Applied only to the source tabletarget_where_clause: Applied only to the target table (Snowflake)
Common Usage Patterns:
Phase 2: Verify Small Subsets of Production Tables¶
Goal: Test against actual production data patterns with limited scope
Approach:
Select a subset of rows from production tables (10,000 - 100,000 rows)
Use
where_clauseandtarget_where_clauseto restrict dataFocus on recent data or specific partitions
Validate critical business columns first
Example Configuration:
Key Points:
🎯 Use date ranges or ID ranges to limit scope
🎯 Test different data patterns (recent vs. historical, high-volume dates)
🎯 Validate critical business columns before validating all columns
Phase 3: Partition-Based Validation for Large Tables¶
Goal: Validate large tables efficiently using partitioning strategy
⚠️ IMPORTANT: For tables with millions or billions of rows, validating the entire table at once is:
Resource-intensive (high compute costs)
Time-consuming (can take hours/days)
Risky (harder to identify specific issue patterns)
Recommended Approach for Large Tables:
Strategy 1: Date-Based Partitioning
Validate data in chunks based on date ranges:
Strategy 2: Modulo-Based Sampling
Use modulo arithmetic to sample evenly distributed rows:
Strategy 3: Statistical Sampling
For very large tables (> 100M rows), validate representative samples:
Strategy 4: Progressive Partition Validation
Validate multiple partitions progressively:
Phase 4: Full Validation¶
Goal: Complete comprehensive validation after successful subset testing
When to Run Full Validation:
✅ Sample validations pass successfully
✅ Subset validations show no data quality issues
✅ You have allocated sufficient time and compute resources
✅ Preferably during off-peak hours
Considerations:
Use asynchronous validation for large datasets to avoid timeouts
Consider using script generation mode to run validations in parallel
Monitor resource consumption on both source and target systems
Plan for validation to run during maintenance windows
Example:
Performance Optimization Tips¶
1. Use Appropriate Validation Types¶
Not all tables need all validation types:
2. Prioritize Critical Columns¶
For large tables, validate critical business columns first:
3. Leverage Partitioning Metadata¶
If your tables are partitioned, validate partition by partition:
4. Use Asynchronous Validation for Production¶
For production environments, use asynchronous validation:
Cost and Resource Management¶
Estimate Query Costs¶
Before running validation on large tables:
Estimate row counts:
Estimate data volume:
Start with small partitions: If estimated size > 10 GB, break into smaller chunks
Compute Warehouse Sizing (Snowflake)¶
Small tables (< 1M rows): XS or S warehouse
Medium tables (1M - 10M rows): S or M warehouse
Large tables (> 10M rows): M or L warehouse, use partitioning
Very large tables (> 100M rows): L or XL warehouse, mandatory partitioning
Common Pitfalls to Avoid¶
❌ Don’t Do This |
✅ Do This Instead |
|---|---|
Validate entire 1B row table at once |
Validate in partitions of 10M-100M rows |
Run validation during business hours on production |
Schedule during off-peak hours or use read replicas |
Skip sample testing and go straight to full validation |
Always validate samples first |
Use same configuration for all table sizes |
Tailor validation strategy to table size |
Validate all columns for all tables |
Prioritize critical columns, especially for large tables |
Ignore resource consumption |
Monitor and set appropriate compute resources |
Validation Checklist¶
Use this checklist to ensure you’re following best practices:
Before Starting:
[ ] ODBC drivers installed and tested
[ ] Connectivity to source and target verified
[ ] Configuration template generated
[ ] Test credentials have appropriate read permissions
Initial Testing (Phase 1):
[ ] Selected 1-2 small tables for initial test
[ ] Configuration file created and reviewed
[ ] Sample validation executed successfully
[ ] Validation report reviewed and understood
Subset Validation (Phase 2):
[ ] Identified subset of production data to validate
[ ] Used
where_clauseandtarget_where_clauseto restrict rows[ ] Validated 10,000 - 100,000 rows successfully
[ ] Reviewed results for any data quality issues
Large Table Strategy (Phase 3):
[ ] Identified tables > 10M rows
[ ] Chosen partitioning strategy (date, ID range, modulo)
[ ] Estimated compute costs for validation
[ ] Tested validation on 1-2 partitions first
[ ] Documented partition validation schedule
Production Validation (Phase 4):
[ ] All subset validations passed
[ ] Resource allocation planned (compute, time)
[ ] Validation scheduled during maintenance window
[ ] Using asynchronous or script generation mode
[ ] Monitoring plan in place
Example: Complete Validation Strategy¶
Here’s a complete example of validating a large e-commerce database:
Day 1: Initial Setup and Small Tables
Day 2: Subset of Medium Tables
Day 3: Partition Strategy for Large Tables
Day 4-5: Progressive Partition Validation
Day 6: Full Validation (Off-Peak)
CLI Commands¶
Command Structure¶
All commands follow this consistent structure:
Where:
<source_dialect>is one of:sqlserver,teradata,redshift,snowflake.<command>is one of:run-validation- Run synchronous validationrun-async-validation- Run asynchronous validationgenerate-validation-scripts- Generate validation scriptsget-configuration-files- Get configuration templatesauto-generated-configuration-file- Interactive config generation
Global Options¶
These options can be used with the CLI without specifying a source dialect or command:
Check Version¶
Display the current installed version of the Snowflake Data Validation CLI:
Output Example:
Use Cases:
Verify successful installation
Check which version is currently installed
Confirm version before reporting issues
Ensure compatibility with documentation
Help¶
Display general help information:
Dialect-Specific Command References¶
For detailed command documentation specific to your source database, see the following pages:
SQL Server Commands Reference - Complete command reference for SQL Server migrations
Teradata Commands Reference - Complete command reference for Teradata migrations
Amazon Redshift Commands Reference - Complete command reference for Redshift migrations
Snowflake Commands Reference - Complete command reference for Snowflake-to-Snowflake migrations
Each page provides:
Detailed syntax for all commands
Complete option descriptions with examples
Connection configuration specifics
Dialect-specific examples
Troubleshooting tips for that platform
Best practices for that database type
SQL Server Commands¶
For complete SQL Server command documentation, see SQL Server Commands Reference.
Quick Links:
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see SQL Server Commands Reference.
Teradata Commands¶
For complete Teradata command documentation, see Teradata Commands Reference.
Quick Links:
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see Teradata Commands Reference.
Amazon Redshift Commands¶
For complete Amazon Redshift command documentation, see Redshift Commands Reference.
Quick Links:
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see Redshift Commands Reference.
Snowflake Commands¶
For complete Snowflake-to-Snowflake command documentation, see Snowflake Commands Reference.
Quick Links:
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see Snowflake Commands Reference.
Configuration File Reference¶
Global Configuration¶
The global configuration section defines the overall behavior of the validation process.
Platform Configuration Options¶
source_platform (required)
Type: String
Valid Values:
SqlServer,Teradata,Redshift,SnowflakeDescription: The source database platform for validation
Example:
source_platform: SqlServer
target_platform (required)
Type: String
Valid Values:
SnowflakeDescription: The target database platform (currently only Snowflake is supported)
Example:
target_platform: Snowflake
output_directory_path (required)
Type: String (path)
Description: Directory where validation results, logs, and reports will be saved
Example:
output_directory_path: /home/user/validation_output
max_threads (optional)
Type: String or Integer
Valid Values:
"auto"or positive integer (1-32)Default:
"auto"Description: Controls parallelization for validation operations
"auto": Automatically detects optimal thread count based on CPU coresInteger value: Specifies exact number of threads to use
Examples:
target_database (required for Teradata only)
Type: String
Description: Target database name in Snowflake for Teradata validations
Example:
target_database: PROD_DB
source_validation_files_path (optional)
Type: String (path)
Description: Path to the directory containing the source validation files.
Example:
source_validation_files_path: /path/to/source_validation_file/directory
target_validation_files_path (optional)
Type: String (path)
Description: Path to the directory containing the target validation files.
Example:
target_validation_files_path: /path/to/targetvalidation_file/directory
Connection Configuration¶
Define how to connect to source and target databases.
Source Connection Configuration¶
SQL Server Source Connection¶
Connection Fields:
mode(required)Type: String
Valid Values:
credentialsDescription: Connection mode for SQL Server
host(required)Type: String
Description: SQL Server hostname or IP address
Example:
"sqlserver.company.com"or"192.168.1.100"
port(required)Type: Integer
Default: 1433
Description: SQL Server port number
username(required)Type: String
Description: SQL Server authentication username
password(required)Type: String
Description: SQL Server authentication password
Security Note: Consider using environment variables or secret management
database(required)Type: String
Description: SQL Server database name
trust_server_certificate(optional)Type: String
Valid Values:
"yes","no"Default:
"no"Description: Whether to trust the server certificate for SSL/TLS connections
encrypt(optional)Type: String
Valid Values:
"yes","no","optional"Default:
"yes"Description: Connection encryption setting
Teradata Source Connection¶
Connection Fields:
mode(required)Type: String
Valid Values:
credentials
host(required)Type: String
Description: Teradata hostname or IP address
username(required)Type: String
Description: Teradata authentication username
password(required)Type: String
Description: Teradata authentication password
database(required)Type: String
Description: Teradata database name
Amazon Redshift Source Connection¶
Connection Fields:
mode(required)Type: String
Valid Values:
credentials
host(required)Type: String
Description: Redshift cluster endpoint
port(required)Type: Integer
Default: 5439
Description: Redshift port number
username(required)Type: String
Description: Redshift authentication username
password(required)Type: String
Description: Redshift authentication password
database(required)Type: String
Description: Redshift database name
Target Connection (Snowflake)¶
Snowflake connections support three modes: name, default, and credentials (IPC only and SnowConvert exclusive).
Option 1: Named Connection¶
Use a pre-configured Snowflake connection saved in your Snowflake connections file.
Fields:
mode(required): Must be"name"name(required): Name of the saved Snowflake connection
Option 2: Default Connection¶
Use the default Snowflake connection from your environment.
Fields:
mode(required): Must be"default"
Option 3: Credentials Mode (IPC Only)¶
Note: The
credentialsmode is only available when using IPC (Inter-Process Communication) commands directly via CLI parameters, not in YAML configuration files. This mode is exclusive to the SnowConvert UI.
Validation Configuration¶
Controls which validation levels are executed.
Validation Options:
schema_validation(optional)Type: Boolean
Default:
trueDescription: Validates table and column schema consistency
Checks:
Column names match between source and target
Data types are compatible
Column nullability settings
Primary key definitions
metrics_validation(optional)Type: Boolean
Default:
trueDescription: Validates statistical metrics for each column
Checks:
Row counts
Distinct value counts
Null value counts
Min/max values
Average, sum, standard deviation (for numeric columns)
row_validation(optional)Type: Boolean
Default:
falseDescription: Validates data at the row level using hash-based comparison
Note: Requires index columns for row identification. If not specified in the configuration, the tool attempts to auto-detect them from primary keys.
Warning: This is the most resource-intensive validation level
Checks:
MD5 hash comparison of row chunks
Identifies specific rows with differences
max_failed_rows_number(optional)Type: Integer
Default: 100
Minimum: 1
Description: Maximum number of failed rows to report per table
Example:
max_failed_rows_number: 250
exclude_metrics(optional)Type: Boolean
Default:
falseDescription: When
true, excludes certain statistical metrics (avg, sum, stddev, variance) from validationUse Case: Useful for large tables where statistical calculations might cause an overflow.
apply_metric_column_modifier(optional)Type: Boolean
Default:
trueDescription: Applies column modifiers defined in metric templates
Use Case: Advanced users with custom metric calculations
custom_templates_path(optional)Type: String (path)
Description: Path to directory containing custom Jinja2 query templates
Example:
custom_templates_path: /opt/validation/custom_templates
Table Configuration¶
Defines which tables to validate and how to validate them.
Table Configuration Fields:
fully_qualified_name(required)Type: String
Format:
database.schema.tableorschema.tableDescription: Full table identifier in the source database
Examples:
target_database(optional)Type: String
Default: Source database name from fully_qualified_name field
Description: Target database name if different from source database name
Example:
target_schema(optional)Type: String
Default: Source schema name from fully_qualified_name field
Description: Target schema name if different from source schema name
Example:
target_name(optional)Type: String
Default: Source table name from fully_qualified_name field
Description: Target table name if different from source table name
Example:
use_column_selection_as_exclude_list(required)Type: Boolean
Description: Determines how
column_selection_listis interpretedfalse: Include only the specified columnstrue: Exclude the specified columns (include all others)
Examples:
column_selection_list(required)Type: List of strings
Description: List of column names to include or exclude
Note: Use an empty list
[]to include all columns
index_column_list(optional)Type: List of strings
Default: Auto-detected from primary keys
Description: Columns to use as unique identifiers for row validation
Use Case: Specify when the table doesn’t have a primary key or you want to use different columns
Example:
target_index_column_list(optional)Type: List of strings
Description: Index columns in the target table (if different from source)
Note: Automatically derived from
column_mappingsif not specified
where_clause(optional)Type: String
Default:
""(empty, no filter)Description: SQL WHERE clause to filter source data (without “WHERE” keyword)
Examples:
target_where_clause(optional)Type: String
Default:
""(empty, no filter)Description: SQL WHERE clause to filter target data
Best Practice: Should match
where_clauseto ensure consistent comparisonExample:
is_case_sensitive(optional)Type: Boolean
Default:
falseDescription: Whether column name matching should be case-sensitive
chunk_number(optional)Type: Integer
Default: 0 (no chunking)
Minimum: 0
Description: Number of chunks to split row validation into
Use Case: Large tables benefit from chunking for better performance
Example:
max_failed_rows_number(optional)Type: Integer
Minimum: 1
Description: Maximum failed rows to report for this specific table
Note: Overrides the global
max_failed_rows_numbersetting
column_mappings(optional)Type: Dictionary (key-value pairs)
Description: Maps source column names to target column names when they differ
Format:
source_column_name: target_column_nameExample:
exclude_metrics(optional)Type: Boolean
Description: Exclude metrics validation for this specific table
Note: Overrides the global
exclude_metricssetting
apply_metric_column_modifier(optional)Type: Boolean
Description: Apply column modifiers for this specific table
Note: Overrides the global
apply_metric_column_modifiersetting
View Configuration¶
Views are validated similarly to tables but are configured in a separate views: section. View validation creates temporary tables internally to materialize view schema for comparison between source and target systems.
View Configuration Fields:
View configuration uses the same fields as table configuration, with the following key points:
fully_qualified_name(required)Type: String
Format:
database.schema.vieworschema.viewDescription: Full view identifier in the source database
Examples:
target_database(optional)Type: String
Default: Source database name from fully_qualified_name field
Description: Target database name if different from source database name
target_schema(optional)Type: String
Default: Source schema name from fully_qualified_name field
Description: Target schema name if different from source schema name
target_name(optional)Type: String
Default: Source view name from fully_qualified_name field
Description: Target view name if different from source view name
use_column_selection_as_exclude_list(required)Type: Boolean
Description: Determines how
column_selection_listis interpretedfalse: Include only the specified columnstrue: Exclude the specified columns (include all others)
column_selection_list(required)Type: List of strings
Description: List of column names to include or exclude
Note: Use an empty list
[]to include all columns
index_column_list(optional)Type: List of strings
Description: Columns to use as unique identifiers for row validation
Use Case: Required when row validation is enabled
where_clause(optional)Type: String
Default:
""(empty, no filter)Description: SQL WHERE clause to filter source data (without “WHERE” keyword)
target_where_clause(optional)Type: String
Default:
""(empty, no filter)Description: SQL WHERE clause to filter target data
column_mappings(optional)Type: Dictionary (key-value pairs)
Description: Maps source column names to target column names when they differ
chunk_number(optional)Type: Integer
Default: 0 (no chunking)
Description: Number of chunks to split row validation into
max_failed_rows_number(optional)Type: Integer
Description: Maximum failed rows to report for this specific view
is_case_sensitive(optional)Type: Boolean
Default:
falseDescription: Whether column name matching should be case-sensitive
How View Validation Works:
The CLI creates temporary tables from view definitions in both source and target systems
Data is extracted from the views into these temporary tables
Validation is performed on the temporary tables
Temporary tables are cleaned up after validation completes
Best Practices for View Validation:
Use filtering for large views: Apply
where_clauseandtarget_where_clauseto limit data volumeTest with small subsets first: Start with filtered validation before full view validation
Consider view complexity: Complex views with many joins may take longer to validate
Monitor resource usage: Views that materialize large datasets consume significant memory
Comparison Configuration¶
Controls comparison behavior and tolerance levels.
Comparison Options:
tolerance(optional)Type: Float
Default: 0.001 (0.1%)
Description: Acceptable tolerance for statistical metric differences
Use Case: Allows for small differences due to rounding or data type conversions
Examples:
Logging Configuration¶
Controls logging behavior for validation operations.
Logging Options:
level(optional)Type: String
Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICALDefault:
INFODescription: Default logging level for all loggers
Level Descriptions:
DEBUG: Detailed diagnostic informationINFO: General informational messagesWARNING: Warning messages for potentially problematic situationsERROR: Error messages for serious issuesCRITICAL: Critical errors that may cause application failure
console_level(optional)Type: String
Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICALDefault: Same as
levelDescription: Logging level for console output
Use Case: Set to
WARNINGorERRORto reduce console noise
file_level(optional)Type: String
Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICALDefault: Same as
levelDescription: Logging level for file output
Use Case: Set to
DEBUGfor detailed file logs while keeping console clean
Example Configurations:
Note: CLI --log-level parameter overrides configuration file settings.
Database and Schema Mappings¶
Map source database/schema names to target names when they differ.
Mapping Options:
database_mappings(optional)Type: Dictionary
Description: Maps source database names to target database names
Use Case: When database names differ between source and Snowflake
Example:
schema_mappings(optional)Type: Dictionary
Description: Maps source schema names to target schema names
Use Case: When schema names differ between source and Snowflake
Example:
Complete Configuration Examples¶
Example 1: SQL Server to Snowflake - Basic Validation¶
Example 2: Teradata to Snowflake - Comprehensive Validation¶
Example 3: Redshift to Snowflake - Advanced Configuration¶
Example 4: Minimal Configuration¶
Example 5: View Validation Configuration¶
Example 6: Combined Tables and Views Validation¶
Advanced Usage¶
Working with Large Tables¶
For large tables, consider these optimization strategies:
Enable Chunking¶
Increase Thread Count¶
3. Filter Data¶
Selective Column Validation¶
Using Custom Query Templates¶
For advanced users, you can provide custom Jinja2 templates:
Custom template directory structure:
Asynchronous Validation Workflow¶
For environments with restricted database access or long-running validations:
Step 1: Generate Scripts¶
This generates SQL scripts in the output directory.
Step 2: Execute Scripts Manually¶
Execute the generated scripts on source and target databases, saving results to CSV files.
Step 3: Run Async Validation¶
This compares the pre-generated metadata files.
CI/CD Integration¶
Integrate validation into your deployment pipeline:
Handling Multiple Environments¶
Create separate configuration files for each environment:
Run validation for specific environment:
Validation Reports¶
Overview¶
The Snowflake Data Validation tool generates comprehensive CSV reports that document the results of data migration validations between source and target databases. These reports help identify discrepancies in schema, metrics, and row-level data.
Report Types¶
The validation tool generates different types of reports based on the validation levels configured:
1. Main Validation Report (validation_report.csv )¶
This consolidated report contains results from schema and metrics validations for all tables.
2. Row Validation Reports (per table)¶
Separate reports generated for each table when row validation is enabled, containing detailed row-level comparison results.
Main Validation Report Structure¶
The main validation report contains the following columns:
Column Name |
Description |
|---|---|
VALIDATION_TYPE |
Type of validation performed: |
TABLE |
Fully qualified name of the table being validated (e.g., |
COLUMN_VALIDATED |
Name of the column being validated (or table-level attribute for schema validation) |
EVALUATION_CRITERIA |
The specific property being compared (e.g., |
SOURCE_VALUE |
The value from the source database |
SNOWFLAKE_VALUE |
The value from the target (Snowflake) database |
STATUS |
Validation result status (see Status Values section below) |
COMMENTS |
Additional context or explanation for the validation result |
Validation Types Explained¶
Schema Validation¶
Compares structural metadata between source and target tables:
Column existence: Ensures columns present in source exist in target
Data types: Validates column data types match (with configurable mappings)
Nullability: Checks if NULL constraints match between source and target
Primary keys: Verifies primary key definitions
Column precision/scale: Validates numeric precision and scale values
Character length: Compares VARCHAR/CHAR column lengths
Example Schema Validation Row:
Metrics Validation¶
Compares statistical metrics calculated on column data:
Row count: Total number of rows in the table
min: Minimum value in numeric/date columns
max: Maximum value in numeric/date columns
count: Count of non-null values
count_distinct: Number of distinct values
avg, sum, stddev, variance: Statistical measures (can be excluded via configuration)
Example Metrics Validation Row:
Row Validation¶
Generates separate per-table reports comparing actual row data using MD5 checksums to detect differences.
Row Validation Report Structure¶
Row validation reports have a different structure focused on identifying specific rows with differences:
Column Name |
Description |
|---|---|
ROW_NUMBER |
Sequential row number in the report |
TABLE_NAME |
Fully qualified table name |
RESULT |
Outcome of the row comparison (see Result Values below) |
[INDEX_COLUMNS]_SOURCE |
Primary key/index column values from source |
[INDEX_COLUMNS]_TARGET |
Primary key/index column values from target |
SOURCE_QUERY |
SQL query to retrieve the row from source database |
TARGET_QUERY |
SQL query to retrieve the row from target database |
Status Values¶
The STATUS column in the main validation report can have the following values:
Status |
Meaning |
|---|---|
SUCCESS |
Validation passed - values match between source and target |
FAILURE |
Validation failed - values differ between source and target |
WARNING |
Potential issue detected that may require attention |
NOT_FOUND_SOURCE |
Element exists in target but not in source |
NOT_FOUND_TARGET |
Element exists in source but not in target |
Result Values (Row Validation)¶
The RESULT column in row validation reports can have the following values:
Result |
Meaning |
|---|---|
SUCCESS |
Row data matches between source and target |
FAILURE |
Row data differs between source and target (MD5 checksum mismatch) |
NOT_FOUND_SOURCE |
Row exists in target but not in source |
NOT_FOUND_TARGET |
Row exists in source but not in target |
Understanding Validation Results¶
Interpreting Schema Validation Results¶
Success Scenario:
All columns exist in both source and target
Data types match (considering configured type mappings)
Nullability constraints are consistent
All structural attributes align
Common Failure Scenarios:
Data Type Mismatch
Source:
VARCHAR(50), Target:VARCHAR(100)Status: May be SUCCESS if within tolerance, or FAILURE if strict matching is required
Missing Column
Source has column
phone_number, target does notStatus: NOT_FOUND_TARGET
Nullability Difference
Source:
NOT NULL, Target:NULLStatus: FAILURE
Interpreting Metrics Validation Results¶
Success Scenario:
Row counts match exactly
Statistical metrics are within configured tolerance (default: 0.1%)
All calculated metrics align between source and target
Common Failure Scenarios:
Row Count Mismatch
Source: 10,000 rows, Target: 9,998 rows
Status: FAILURE
Action: Investigate missing rows
Min/Max Value Differences
Source max date:
2024-12-31, Target max date:2024-12-30Status: FAILURE
Action: Check for incomplete data migration
Statistical Variance
Source count_distinct: 1,000, Target count_distinct: 995
Status: FAILURE (if beyond tolerance)
Action: Investigate potential duplicates or missing values
Interpreting Row Validation Results¶
Success Scenario:
All rows in source have matching rows in target (by MD5 checksum)
No orphaned rows in either database
Primary key values align correctly
Common Failure Scenarios:
Row Content Mismatch
Same primary key, different column values
Result: FAILURE
Action: Use provided SQL queries to investigate specific differences
Missing Rows
Row exists in source but not in target
Result: NOT_FOUND_TARGET
Action: Check migration completeness
Extra Rows
Row exists in target but not in source
Result: NOT_FOUND_SOURCE
Action: Investigate unexpected data in target
Using the Reports¶
Quick Assessment¶
Filter by STATUS column: Focus on
FAILURE,WARNING,NOT_FOUND_SOURCE, andNOT_FOUND_TARGETrowsGroup by VALIDATION_TYPE: Assess schema issues separately from metrics issues
Group by TABLE: Identify which tables have the most issues
Investigating Failures¶
For Schema Validation:
Review the
EVALUATION_CRITERIAto understand what attribute failedCompare
SOURCE_VALUEvsSNOWFLAKE_VALUECheck if differences are acceptable (e.g., VARCHAR size increase)
Update type mappings or schema definitions if needed
For Metrics Validation:
Review the metric that failed (e.g.,
row_count,max,min)Calculate the difference magnitude
Determine if within acceptable business tolerance
Use the detailed queries to investigate source of discrepancy
For Row Validation:
Open the table-specific row validation report
Identify rows with
FAILUREstatusUse the provided
SOURCE_QUERYandTARGET_QUERYto retrieve actual row dataCompare column-by-column to identify specific field differences
Investigate why values differ (data type conversion, truncation, transformation)
Configuration Options Affecting Reports¶
Tolerance Settings¶
The comparison_configuration.tolerance setting affects metrics validation:
Values within tolerance are marked as SUCCESS
Values beyond tolerance are marked as FAILURE
Validation Levels¶
Control which validations run and therefore which reports are generated:
Excluded Metrics¶
Exclude specific metrics from validation:
Maximum Failed Rows¶
Limit the number of failed rows reported in row validation:
Report File Locations¶
Reports are generated in the configured output directory:
File naming convention:
Timestamp format:
YYYY-MM-DD_HH-MM-SSRow validation reports include table name and unique ID to prevent collisions
Best Practices¶
Start with Schema Validation: Ensure structural alignment before validating data
Use Appropriate Tolerance: Set realistic tolerance thresholds for metrics validation
Selective Row Validation: Enable row validation only for critical tables (resource intensive)
Iterative Approach: Fix schema issues first, then metrics, then row-level differences
Document Acceptable Differences: Some type conversions or value transformations may be expected
Automate Report Analysis: Use scripts to parse CSV reports and flag critical issues
Preserve Reports: Archive validation reports for audit trails and compliance
Troubleshooting Report Issues¶
All Validations Showing FAILURE¶
Possible Causes:
Incorrect database/schema mappings in configuration
Type mapping file not loaded correctly
Connection to wrong target database
Solution: Verify database_mappings, schema_mappings, and connection settings
Row Validation Shows All NOT_FOUND_TARGET¶
Possible Causes:
Target table empty or not migrated yet
Incorrect target table name
Primary key/index columns mismatch
Solution: Verify target table exists and contains data, check column mappings
Metrics Validation Shows Large Differences¶
Possible Causes:
Incomplete data migration
Data type conversion issues causing value changes
Filter/WHERE clause differences between source and target queries
Solution: Review migration logs, verify row counts first, check data transformations
Report File Not Generated¶
Possible Causes:
Output directory doesn’t exist or lacks write permissions
Validation configuration has all levels set to false
Application crashed before report generation
Solution: Check output path permissions, review logs for errors, enable at least one validation level
Troubleshooting¶
Common Issues and Solutions¶
Issue: “Configuration file not found”¶
Symptom:
Solution:
Verify the file path is correct
Use absolute paths:
/home/user/configs/validation.yamlCheck file permissions
Issue: “Connection error”¶
Symptom:
Solutions:
Verify connection parameters:
Check network connectivity:
Verify firewall rules allow connections from your machine
For SQL Server SSL issues:
Issue: “Invalid parameter”¶
Symptom:
Solution:
Issue: “Table not found”¶
Symptom:
Solutions:
Verify fully qualified name:
Check case sensitivity:
Verify table exists in source database
Issue: “YAML formatting error”¶
Symptom:
Solutions:
Check indentation (use spaces, not tabs)
Incorrect example (mixed indentation with tabs):
Quote special characters:
Validate YAML syntax using online validators or:
Issue: “Validation fails with tolerance errors”¶
Symptom:
Solution: Adjust tolerance in configuration:
Issue: “Out of memory errors with large tables”¶
Solutions:
Enable chunking:
Reduce thread count:
Filter data:
Exclude large columns:
Getting Help¶
Check logs: Review log files in the output directory
Enable debug logging:
Review validation reports in the output directory
Consult documentation: Full Documentation
Report issues: Email us at:snowconvert-support@snowflake.com