Snowflake Data Validation CLI - Complete Usage Guide¶
Overview¶
The Snowflake Data Validation CLI (snowflake-data-validation or sdv) is a comprehensive command-line tool for validating data migrations between source databases (SQL Server, Teradata, Amazon Redshift) and Snowflake. It provides multi-level validation strategies to ensure data consistency and quality.
Key Features¶
- Multi-Level Validation: Schema, statistical metrics, and row-by-row data validation
- Multiple Source Platforms: SQL Server, Teradata, and Amazon Redshift
- Tables and Views Validation: Validate both tables and database views
- Flexible Execution Modes: Synchronous, asynchronous, and script generation
- Comprehensive Configuration: YAML-based configuration with extensive customization options
- Detailed Reporting: Comprehensive validation reports with mismatch information
Prerequisites¶
Before installing the Snowflake Data Validation CLI, ensure you have the following prerequisites installed:
System Requirements¶
- Python: Version 3.8 or higher
- pip: Latest version recommended
- Operating System: Linux, macOS, or Windows
ODBC Drivers¶
The CLI requires appropriate ODBC drivers to be installed on your system for connecting to source databases. Install the ODBC driver that matches your source database dialect:
SQL Server ODBC Driver¶
For SQL Server as a source database, you need the Microsoft ODBC Driver for SQL Server.
Recommended Version: ODBC Driver 17 or 18 for SQL Server
Installation Instructions:
-
Linux:
-
macOS:
-
Windows: Download and install from Microsoft’s official download page
Verification:
Documentation: Microsoft ODBC Driver for SQL Server
Teradata ODBC Driver¶
For Teradata as a source database, you need the Teradata ODBC Driver.
Recommended Version: Teradata ODBC Driver 17.20 or higher
Installation Instructions:
- Download the Teradata Tools and Utilities (TTU) package from Teradata Downloads
- Select your operating system and download the appropriate installer
- Run the installer and select “ODBC Driver” during installation
- Configure the driver according to Teradata’s documentation
Note: You may need to create a Teradata account to access the download page.
Configuration: After installation, you may need to configure the ODBC driver:
- Linux/macOS: Edit
/etc/odbc.iniand/etc/odbcinst.ini - Windows: Use the ODBC Data Source Administrator
Documentation: Teradata ODBC Driver Documentation
Amazon Redshift ODBC Driver¶
For Amazon Redshift as a source database, you need the Amazon Redshift ODBC Driver or a PostgreSQL ODBC Driver (since Redshift is PostgreSQL-compatible).
Option 1: Amazon Redshift ODBC Driver (Recommended)
Recommended Version: Amazon Redshift ODBC Driver 2.x
Installation Instructions:
- Download from Amazon Redshift ODBC Driver Download
- Choose your operating system and architecture
- Follow the installation wizard
Option 2: PostgreSQL ODBC Driver (Alternative)
-
Linux:
-
macOS:
-
Windows: Download from PostgreSQL ODBC Driver
Verification:
Documentation: Amazon Redshift ODBC Driver Documentation
Additional Tools (Optional)¶
-
unixODBC (Linux/macOS): Required for ODBC driver management
Network Access¶
Ensure your environment has network access to:
- Source database (SQL Server, Teradata, or Redshift)
- Snowflake account
- Package repositories (for pip installation)
Installation¶
Base Installation¶
Source-Specific Installation¶
Install with the appropriate database driver for your source system:
Post-Installation Verification¶
After installation, verify the CLI is correctly installed:
Quick Start¶
1. Generate a Configuration Template¶
2. Auto-Generate Configuration from Connection¶
3. Run Validation¶
Best Practices & Guidance¶
This section provides strategic guidance on how to approach data validation effectively, minimize resource consumption, and identify issues early.
Incremental Validation Approach¶
Note
Always start small and scale up incrementally. Running full validation on large datasets immediately can:
- Consume significant compute resources on both source and target systems
- Take hours or days to complete
- Make troubleshooting difficult if issues are found
- Impact production systems if run during business hours
Recommended Validation Strategy¶
Follow this proven approach to ensure efficient and effective validation:
Phase 1: Start with a Sample Dataset¶
Goal: Verify configuration and establish baseline
Approach:
- Test with 1-2 small tables first (< 100,000 rows)
- Choose tables with diverse data types to validate type mapping
- Verify connectivity and authentication work correctly
- Confirm output format meets your needs
Example Configuration:
What to Verify:
- ✅ Connection to both source and target successful
- ✅ Schema validation passes
- ✅ Row count matches
- ✅ Data types mapped correctly
- ✅ Validation report generated successfully
Understanding where_clause vs target_where_clause:
The tool provides two filtering options:
-
where_clause: Applied only to the source table -
target_where_clause: Applied only to the target table (Snowflake)
Common Usage Patterns:
Phase 2: Verify Small Subsets of Production Tables¶
Goal: Test against actual production data patterns with limited scope
Approach:
- Select a subset of rows from production tables (10,000 - 100,000 rows)
- Use
where_clauseandtarget_where_clauseto restrict data - Focus on recent data or specific partitions
- Validate critical business columns first
Example Configuration:
Key Points:
- 🎯 Use date ranges or ID ranges to limit scope
- 🎯 Test different data patterns (recent vs. historical, high-volume dates)
- 🎯 Validate critical business columns before validating all columns
Phase 3: Partition-Based Validation for Large Tables¶
Goal: Validate large tables efficiently using partitioning strategy
⚠️ IMPORTANT: For tables with millions or billions of rows, validating the entire table at once is:
- Resource-intensive (high compute costs)
- Time-consuming (can take hours/days)
- Risky (harder to identify specific issue patterns)
Recommended Approach for Large Tables:
Strategy 1: Date-Based Partitioning
Validate data in chunks based on date ranges:
Strategy 2: Modulo-Based Sampling
Use modulo arithmetic to sample evenly distributed rows:
Strategy 3: Statistical Sampling
For very large tables (> 100M rows), validate representative samples:
Strategy 4: Progressive Partition Validation
Validate multiple partitions progressively:
Phase 4: Full Validation¶
Goal: Complete comprehensive validation after successful subset testing
When to Run Full Validation:
- ✅ Sample validations pass successfully
- ✅ Subset validations show no data quality issues
- ✅ You have allocated sufficient time and compute resources
- ✅ Preferably during off-peak hours
Considerations:
- Use asynchronous validation for large datasets to avoid timeouts
- Consider using script generation mode to run validations in parallel
- Monitor resource consumption on both source and target systems
- Plan for validation to run during maintenance windows
Example:
Performance Optimization Tips¶
1. Use Appropriate Validation Types¶
Not all tables need all validation types:
2. Prioritize Critical Columns¶
For large tables, validate critical business columns first:
3. Leverage Partitioning Metadata¶
If your tables are partitioned, validate partition by partition:
4. Use Asynchronous Validation for Production¶
For production environments, use asynchronous validation:
Cost and Resource Management¶
Estimate Query Costs¶
Before running validation on large tables:
-
Estimate row counts:
-
Estimate data volume:
-
Start with small partitions: If estimated size > 10 GB, break into smaller chunks
Compute Warehouse Sizing (Snowflake)¶
- Small tables (< 1M rows): XS or S warehouse
- Medium tables (1M - 10M rows): S or M warehouse
- Large tables (> 10M rows): M or L warehouse, use partitioning
- Very large tables (> 100M rows): L or XL warehouse, mandatory partitioning
Common Pitfalls to Avoid¶
| ❌ Don’t Do This | ✅ Do This Instead |
|---|---|
| Validate entire 1B row table at once | Validate in partitions of 10M-100M rows |
| Run validation during business hours on production | Schedule during off-peak hours or use read replicas |
| Skip sample testing and go straight to full validation | Always validate samples first |
| Use same configuration for all table sizes | Tailor validation strategy to table size |
| Validate all columns for all tables | Prioritize critical columns, especially for large tables |
| Ignore resource consumption | Monitor and set appropriate compute resources |
Validation Checklist¶
Use this checklist to ensure you’re following best practices:
Before Starting:
- ODBC drivers installed and tested
- Connectivity to source and target verified
- Configuration template generated
- Test credentials have appropriate read permissions
Initial Testing (Phase 1):
- Selected 1-2 small tables for initial test
- Configuration file created and reviewed
- Sample validation executed successfully
- Validation report reviewed and understood
Subset Validation (Phase 2):
- Identified subset of production data to validate
- Used
where_clauseandtarget_where_clauseto restrict rows - Validated 10,000 - 100,000 rows successfully
- Reviewed results for any data quality issues
Large Table Strategy (Phase 3):
- Identified tables > 10M rows
- Chosen partitioning strategy (date, ID range, modulo)
- Estimated compute costs for validation
- Tested validation on 1-2 partitions first
- Documented partition validation schedule
Production Validation (Phase 4):
- All subset validations passed
- Resource allocation planned (compute, time)
- Validation scheduled during maintenance window
- Using asynchronous or script generation mode
- Monitoring plan in place
Example: Complete Validation Strategy¶
Here’s a complete example of validating a large e-commerce database:
Day 1: Initial Setup and Small Tables
Day 2: Subset of Medium Tables
Day 3: Partition Strategy for Large Tables
Day 4-5: Progressive Partition Validation
Day 6: Full Validation (Off-Peak)
CLI Commands¶
Command Structure¶
All commands follow this consistent structure:
Where:
<source_dialect>is one of:sqlserver,teradata,redshift,snowflake.<command>is one of:run-validation- Run synchronous validationrun-async-validation- Run asynchronous validationgenerate-validation-scripts- Generate validation scriptsget-configuration-files- Get configuration templatesauto-generated-configuration-file- Interactive config generation
Global Options¶
These options can be used with the CLI without specifying a source dialect or command:
Check Version¶
Display the current installed version of the Snowflake Data Validation CLI:
Output Example:
Use Cases:
- Verify successful installation
- Check which version is currently installed
- Confirm version before reporting issues
- Ensure compatibility with documentation
Help¶
Display general help information:
Dialect-Specific Command References¶
For detailed command documentation specific to your source database, see the following pages:
- SQL Server Commands Reference - Complete command reference for SQL Server migrations
- Teradata Commands Reference - Complete command reference for Teradata migrations
- Amazon Redshift Commands Reference - Complete command reference for Redshift migrations
- Snowflake Commands Reference - Complete command reference for Snowflake-to-Snowflake migrations
Each page provides:
- Detailed syntax for all commands
- Complete option descriptions with examples
- Connection configuration specifics
- Dialect-specific examples
- Troubleshooting tips for that platform
- Best practices for that database type
SQL Server Commands¶
For complete SQL Server command documentation, see SQL Server Commands Reference.
Quick Links:
- Run Synchronous Validation
- Run Asynchronous Validation
- Generate Validation Scripts
- Get Configuration Templates
- Auto-Generate Configuration File
- Connection Configuration
- Troubleshooting
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see SQL Server Commands Reference.
Teradata Commands¶
For complete Teradata command documentation, see Teradata Commands Reference.
Quick Links:
- Run Synchronous Validation
- Run Asynchronous Validation
- Generate Validation Scripts
- Get Configuration Templates
- Auto-Generate Configuration File
- Connection Configuration
- Troubleshooting
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see Teradata Commands Reference.
Amazon Redshift Commands¶
For complete Amazon Redshift command documentation, see Redshift Commands Reference.
Quick Links:
- Run Synchronous Validation
- Run Asynchronous Validation
- Generate Validation Scripts
- Get Configuration Templates
- Auto-Generate Configuration File
- Connection Configuration
- Troubleshooting
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see Redshift Commands Reference.
Snowflake Commands¶
For complete Snowflake-to-Snowflake command documentation, see Snowflake Commands Reference.
Quick Links:
- Run Synchronous Validation
- Run Asynchronous Validation
- Source Validate
- Generate Validation Scripts
- Get Configuration Templates
- Auto-Generate Configuration File
- Connection Configuration
- Troubleshooting
Common Commands¶
Run Validation:
Generate Scripts:
Get Templates:
For complete documentation, see Snowflake Commands Reference.
Configuration File Reference¶
Global Configuration¶
The global configuration section defines the overall behavior of the validation process.
Platform Configuration Options¶
source_platform (required)
- Type: String
- Valid Values:
SqlServer,Teradata,Redshift,Snowflake - Description: The source database platform for validation
- Example:
source_platform: SqlServer
target_platform (required)
- Type: String
- Valid Values:
Snowflake - Description: The target database platform (currently only Snowflake is supported)
- Example:
target_platform: Snowflake
output_directory_path (required)
- Type: String (path)
- Description: Directory where validation results, logs, and reports will be saved
- Example:
output_directory_path: /home/user/validation_output
max_threads (optional)
- Type: String or Integer
- Valid Values:
"auto"or positive integer (1-32) - Default:
"auto" - Description: Controls parallelization for validation operations
"auto": Automatically detects optimal thread count based on CPU cores- Integer value: Specifies exact number of threads to use
- Examples:
target_database (required for Teradata only)
- Type: String
- Description: Target database name in Snowflake for Teradata validations
- Example:
target_database: PROD_DB
source_validation_files_path (optional)
- Type: String (path)
- Description: Path to the directory containing the source validation files.
- Example:
source_validation_files_path: /path/to/source_validation_file/directory
target_validation_files_path (optional)
- Type: String (path)
- Description: Path to the directory containing the target validation files.
- Example:
target_validation_files_path: /path/to/targetvalidation_file/directory
Connection Configuration¶
Define how to connect to source and target databases.
Source Connection Configuration¶
SQL Server Source Connection¶
Connection Fields:
-
mode(required)- Type: String
- Valid Values:
credentials - Description: Connection mode for SQL Server
-
host(required)- Type: String
- Description: SQL Server hostname or IP address
- Example:
"sqlserver.company.com"or"192.168.1.100"
-
port(required)- Type: Integer
- Default: 1433
- Description: SQL Server port number
-
username(required)- Type: String
- Description: SQL Server authentication username
-
password(required)- Type: String
- Description: SQL Server authentication password
- Security Note: Consider using environment variables or secret management
-
database(required)- Type: String
- Description: SQL Server database name
-
trust_server_certificate(optional)- Type: String
- Valid Values:
"yes","no" - Default:
"no" - Description: Whether to trust the server certificate for SSL/TLS connections
-
encrypt(optional)- Type: String
- Valid Values:
"yes","no","optional" - Default:
"yes" - Description: Connection encryption setting
Teradata Source Connection¶
Connection Fields:
-
mode(required)- Type: String
- Valid Values:
credentials
-
host(required)- Type: String
- Description: Teradata hostname or IP address
-
username(required)- Type: String
- Description: Teradata authentication username
-
password(required)- Type: String
- Description: Teradata authentication password
-
database(required)- Type: String
- Description: Teradata database name
Amazon Redshift Source Connection¶
Connection Fields:
-
mode(required)- Type: String
- Valid Values:
credentials
-
host(required)- Type: String
- Description: Redshift cluster endpoint
-
port(required)- Type: Integer
- Default: 5439
- Description: Redshift port number
-
username(required)- Type: String
- Description: Redshift authentication username
-
password(required)- Type: String
- Description: Redshift authentication password
-
database(required)- Type: String
- Description: Redshift database name
Target Connection (Snowflake)¶
Snowflake connections support three modes: name, default, and credentials (IPC only and SnowConvert exclusive).
Option 1: Named Connection¶
Use a pre-configured Snowflake connection saved in your Snowflake connections file.
Fields:
mode(required): Must be"name"name(required): Name of the saved Snowflake connection
Option 2: Default Connection¶
Use the default Snowflake connection from your environment.
Fields:
mode(required): Must be"default"
Option 3: Credentials Mode (IPC Only)¶
Note: The
credentialsmode is only available when using IPC (Inter-Process Communication) commands directly via CLI parameters, not in YAML configuration files. This mode is exclusive to the SnowConvert UI.
Validation Configuration¶
Controls which validation levels are executed.
Validation Options:
-
schema_validation(optional)- Type: Boolean
- Default:
true - Description: Validates table and column schema consistency
- Checks:
- Column names match between source and target
- Data types are compatible
- Column nullability settings
- Primary key definitions
-
metrics_validation(optional)- Type: Boolean
- Default:
true - Description: Validates statistical metrics for each column
- Checks:
-
Row counts
-
Distinct value counts
-
Null value counts
-
Min/max values
-
Average, sum, standard deviation (for numeric columns)
-
-
row_validation(optional)- Type: Boolean
- Default:
false - Description: Validates data at the row level using hash-based comparison
- Note: Requires index columns for row identification. If not specified in the configuration, the tool attempts to auto-detect them from primary keys.
- Warning: This is the most resource-intensive validation level
- Checks:
- MD5 hash comparison of row chunks
- Identifies specific rows with differences
-
max_failed_rows_number(optional)- Type: Integer
- Default: 100
- Minimum: 1
- Description: Maximum number of failed rows to report per table
- Example:
max_failed_rows_number: 250
-
exclude_metrics(optional)- Type: Boolean
- Default:
false - Description: When
true, excludes certain statistical metrics (avg, sum, stddev, variance) from validation - Use Case: Useful for large tables where statistical calculations might cause an overflow.
-
apply_metric_column_modifier(optional)- Type: Boolean
- Default:
true - Description: Applies column modifiers defined in metric templates
- Use Case: Advanced users with custom metric calculations
-
custom_templates_path(optional)- Type: String (path)
- Description: Path to directory containing custom Jinja2 query templates
- Example:
custom_templates_path: /opt/validation/custom_templates
Table Configuration¶
Defines which tables to validate and how to validate them.
Table Configuration Fields:
-
fully_qualified_name(required)- Type: String
- Format:
database.schema.tableorschema.table - Description: Full table identifier in the source database
- Examples:
-
target_database(optional)- Type: String
- Default: Source database name from fully_qualified_name field
- Description: Target database name if different from source database name
- Example:
-
target_schema(optional)- Type: String
- Default: Source schema name from fully_qualified_name field
- Description: Target schema name if different from source schema name
- Example:
-
target_name(optional)- Type: String
- Default: Source table name from fully_qualified_name field
- Description: Target table name if different from source table name
- Example:
-
use_column_selection_as_exclude_list(required)-
Type: Boolean
-
Description: Determines how
column_selection_listis interpretedfalse: Include only the specified columnstrue: Exclude the specified columns (include all others)
-
Examples:
-
-
column_selection_list(required)- Type: List of strings
- Description: List of column names to include or exclude
- Note: Use an empty list
[]to include all columns
-
index_column_list(optional)- Type: List of strings
- Default: Auto-detected from primary keys
- Description: Columns to use as unique identifiers for row validation
- Use Case: Specify when the table doesn’t have a primary key or you want to use different columns
- Example:
-
target_index_column_list(optional)- Type: List of strings
- Description: Index columns in the target table (if different from source)
- Note: Automatically derived from
column_mappingsif not specified
-
where_clause(optional)- Type: String
- Default:
""(empty, no filter) - Description: SQL WHERE clause to filter source data (without “WHERE” keyword)
- Examples:
-
target_where_clause(optional)- Type: String
- Default:
""(empty, no filter) - Description: SQL WHERE clause to filter target data
- Best Practice: Should match
where_clauseto ensure consistent comparison - Example:
-
is_case_sensitive(optional)- Type: Boolean
- Default:
false - Description: Whether column name matching should be case-sensitive
-
chunk_number(optional)- Type: Integer
- Default: 0 (no chunking)
- Minimum: 0
- Description: Number of chunks to split row validation into
- Use Case: Large tables benefit from chunking for better performance
- Example:
-
max_failed_rows_number(optional)- Type: Integer
- Minimum: 1
- Description: Maximum failed rows to report for this specific table
- Note: Overrides the global
max_failed_rows_numbersetting
-
column_mappings(optional)- Type: Dictionary (key-value pairs)
- Description: Maps source column names to target column names when they differ
- Format:
source_column_name: target_column_name - Example:
-
exclude_metrics(optional)- Type: Boolean
- Description: Exclude metrics validation for this specific table
- Note: Overrides the global
exclude_metricssetting
-
apply_metric_column_modifier(optional)- Type: Boolean
- Description: Apply column modifiers for this specific table
- Note: Overrides the global
apply_metric_column_modifiersetting
View Configuration¶
Views are validated similarly to tables but are configured in a separate views: section. View validation creates temporary tables internally to materialize view schema for comparison between source and target systems.
View Configuration Fields:
View configuration uses the same fields as table configuration, with the following key points:
-
fully_qualified_name(required)- Type: String
- Format:
database.schema.vieworschema.view - Description: Full view identifier in the source database
- Examples:
-
target_database(optional)- Type: String
- Default: Source database name from fully_qualified_name field
- Description: Target database name if different from source database name
-
target_schema(optional)- Type: String
- Default: Source schema name from fully_qualified_name field
- Description: Target schema name if different from source schema name
-
target_name(optional)- Type: String
- Default: Source view name from fully_qualified_name field
- Description: Target view name if different from source view name
-
use_column_selection_as_exclude_list(required)- Type: Boolean
- Description: Determines how
column_selection_listis interpretedfalse: Include only the specified columnstrue: Exclude the specified columns (include all others)
-
column_selection_list(required)- Type: List of strings
- Description: List of column names to include or exclude
- Note: Use an empty list
[]to include all columns
-
index_column_list(optional)- Type: List of strings
- Description: Columns to use as unique identifiers for row validation
- Use Case: Required when row validation is enabled
-
where_clause(optional)- Type: String
- Default:
""(empty, no filter) - Description: SQL WHERE clause to filter source data (without “WHERE” keyword)
-
target_where_clause(optional)- Type: String
- Default:
""(empty, no filter) - Description: SQL WHERE clause to filter target data
-
column_mappings(optional)- Type: Dictionary (key-value pairs)
- Description: Maps source column names to target column names when they differ
-
chunk_number(optional)- Type: Integer
- Default: 0 (no chunking)
- Description: Number of chunks to split row validation into
-
max_failed_rows_number(optional)- Type: Integer
- Description: Maximum failed rows to report for this specific view
-
is_case_sensitive(optional)- Type: Boolean
- Default:
false - Description: Whether column name matching should be case-sensitive
How View Validation Works:
- The CLI creates temporary tables from view definitions in both source and target systems
- Data is extracted from the views into these temporary tables
- Validation is performed on the temporary tables
- Temporary tables are cleaned up after validation completes
Best Practices for View Validation:
- Use filtering for large views: Apply
where_clauseandtarget_where_clauseto limit data volume - Test with small subsets first: Start with filtered validation before full view validation
- Consider view complexity: Complex views with many joins may take longer to validate
- Monitor resource usage: Views that materialize large datasets consume significant memory
Comparison Configuration¶
Controls comparison behavior and tolerance levels.
Comparison Options:
tolerance(optional)- Type: Float
- Default: 0.001 (0.1%)
- Description: Acceptable tolerance for statistical metric differences
- Use Case: Allows for small differences due to rounding or data type conversions
- Examples:
Logging Configuration¶
Controls logging behavior for validation operations.
Logging Options:
-
level(optional)- Type: String
- Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICAL - Default:
INFO - Description: Default logging level for all loggers
- Level Descriptions:
DEBUG: Detailed diagnostic informationINFO: General informational messagesWARNING: Warning messages for potentially problematic situationsERROR: Error messages for serious issuesCRITICAL: Critical errors that may cause application failure
-
console_level(optional)- Type: String
- Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICAL - Default: Same as
level - Description: Logging level for console output
- Use Case: Set to
WARNINGorERRORto reduce console noise
-
file_level(optional)- Type: String
- Valid Values:
DEBUG,INFO,WARNING,ERROR,CRITICAL - Default: Same as
level - Description: Logging level for file output
- Use Case: Set to
DEBUGfor detailed file logs while keeping console clean
Example Configurations:
Note: CLI --log-level parameter overrides configuration file settings.
Database and Schema Mappings¶
Map source database/schema names to target names when they differ.
Mapping Options:
-
database_mappings(optional)- Type: Dictionary
- Description: Maps source database names to target database names
- Use Case: When database names differ between source and Snowflake
- Example:
-
schema_mappings(optional)- Type: Dictionary
- Description: Maps source schema names to target schema names
- Use Case: When schema names differ between source and Snowflake
- Example:
Complete Configuration Examples¶
Example 1: SQL Server to Snowflake - Basic Validation¶
Example 2: Teradata to Snowflake - Comprehensive Validation¶
Example 3: Redshift to Snowflake - Advanced Configuration¶
Example 4: Minimal Configuration¶
Example 5: View Validation Configuration¶
Example 6: Combined Tables and Views Validation¶
Advanced Usage¶
Working with Large Tables¶
For large tables, consider these optimization strategies:
Enable Chunking¶
Increase Thread Count¶
3. Filter Data¶
Selective Column Validation¶
Using Custom Query Templates¶
For advanced users, you can provide custom Jinja2 templates:
Custom template directory structure:
Asynchronous Validation Workflow¶
For environments with restricted database access or long-running validations:
Step 1: Generate Scripts¶
This generates SQL scripts in the output directory.
Step 2: Execute Scripts Manually¶
Execute the generated scripts on source and target databases, saving results to CSV files.
Step 3: Run Async Validation¶
This compares the pre-generated metadata files.
CI/CD Integration¶
Integrate validation into your deployment pipeline:
Handling Multiple Environments¶
Create separate configuration files for each environment:
Run validation for specific environment:
Validation Reports¶
Overview¶
The Snowflake Data Validation tool generates comprehensive CSV reports that document the results of data migration validations between source and target databases. These reports help identify discrepancies in schema, metrics, and row-level data.
Report Types¶
The validation tool generates different types of reports based on the validation levels configured:
1. Main Validation Report (validation_report.csv)¶
This consolidated report contains results from schema and metrics validations for all tables.
2. Row Validation Reports (per table)¶
Separate reports generated for each table when row validation is enabled, containing detailed row-level comparison results.
Main Validation Report Structure¶
The main validation report contains the following columns:
| Column Name | Description |
|---|---|
| VALIDATION_TYPE | Type of validation performed: SCHEMA VALIDATION or METRICS VALIDATION |
| TABLE | Fully qualified name of the table being validated (e.g., database.schema.table) |
| COLUMN_VALIDATED | Name of the column being validated (or table-level attribute for schema validation) |
| EVALUATION_CRITERIA | The specific property being compared (e.g., DATA_TYPE, NULLABLE, ROW_COUNT, min, max) |
| SOURCE_VALUE | The value from the source database |
| SNOWFLAKE_VALUE | The value from the target (Snowflake) database |
| STATUS | Validation result status (see Status Values section below) |
| COMMENTS | Additional context or explanation for the validation result |
Validation Types Explained¶
Schema Validation¶
Compares structural metadata between source and target tables:
- Column existence: Ensures columns present in source exist in target
- Data types: Validates column data types match (with configurable mappings)
- Nullability: Checks if NULL constraints match between source and target
- Primary keys: Verifies primary key definitions
- Column precision/scale: Validates numeric precision and scale values
- Character length: Compares VARCHAR/CHAR column lengths
Example Schema Validation Row:
Metrics Validation¶
Compares statistical metrics calculated on column data:
- Row count: Total number of rows in the table
- min: Minimum value in numeric/date columns
- max: Maximum value in numeric/date columns
- count: Count of non-null values
- count_distinct: Number of distinct values
- avg, sum, stddev, variance: Statistical measures (can be excluded via configuration)
Example Metrics Validation Row:
Row Validation¶
Generates separate per-table reports comparing actual row data using MD5 checksums to detect differences.
Row Validation Report Structure¶
Row validation reports have a different structure focused on identifying specific rows with differences:
| Column Name | Description |
|---|---|
| ROW_NUMBER | Sequential row number in the report |
| TABLE_NAME | Fully qualified table name |
| RESULT | Outcome of the row comparison (see Result Values below) |
| [INDEX_COLUMNS]_SOURCE | Primary key/index column values from source |
| [INDEX_COLUMNS]_TARGET | Primary key/index column values from target |
| SOURCE_QUERY | SQL query to retrieve the row from source database |
| TARGET_QUERY | SQL query to retrieve the row from target database |
Status Values¶
The STATUS column in the main validation report can have the following values:
| Status | Meaning |
|---|---|
| SUCCESS | Validation passed - values match between source and target |
| FAILURE | Validation failed - values differ between source and target |
| WARNING | Potential issue detected that may require attention |
| NOT_FOUND_SOURCE | Element exists in target but not in source |
| NOT_FOUND_TARGET | Element exists in source but not in target |
Result Values (Row Validation)¶
The RESULT column in row validation reports can have the following values:
| Result | Meaning |
|---|---|
| SUCCESS | Row data matches between source and target |
| FAILURE | Row data differs between source and target (MD5 checksum mismatch) |
| NOT_FOUND_SOURCE | Row exists in target but not in source |
| NOT_FOUND_TARGET | Row exists in source but not in target |
Understanding Validation Results¶
Interpreting Schema Validation Results¶
Success Scenario:
- All columns exist in both source and target
- Data types match (considering configured type mappings)
- Nullability constraints are consistent
- All structural attributes align
Common Failure Scenarios:
-
Data Type Mismatch
- Source:
VARCHAR(50), Target:VARCHAR(100) - Status: May be SUCCESS if within tolerance, or FAILURE if strict matching is required
- Source:
-
Missing Column
- Source has column
phone_number, target does not - Status: NOT_FOUND_TARGET
- Source has column
-
Nullability Difference
- Source:
NOT NULL, Target:NULL - Status: FAILURE
- Source:
Interpreting Metrics Validation Results¶
Success Scenario:
- Row counts match exactly
- Statistical metrics are within configured tolerance (default: 0.1%)
- All calculated metrics align between source and target
Common Failure Scenarios:
-
Row Count Mismatch
- Source: 10,000 rows, Target: 9,998 rows
- Status: FAILURE
- Action: Investigate missing rows
-
Min/Max Value Differences
- Source max date:
2024-12-31, Target max date:2024-12-30 - Status: FAILURE
- Action: Check for incomplete data migration
- Source max date:
-
Statistical Variance
- Source count_distinct: 1,000, Target count_distinct: 995
- Status: FAILURE (if beyond tolerance)
- Action: Investigate potential duplicates or missing values
Interpreting Row Validation Results¶
Success Scenario:
- All rows in source have matching rows in target (by MD5 checksum)
- No orphaned rows in either database
- Primary key values align correctly
Common Failure Scenarios:
-
Row Content Mismatch
- Same primary key, different column values
- Result: FAILURE
- Action: Use provided SQL queries to investigate specific differences
-
Missing Rows
- Row exists in source but not in target
- Result: NOT_FOUND_TARGET
- Action: Check migration completeness
-
Extra Rows
- Row exists in target but not in source
- Result: NOT_FOUND_SOURCE
- Action: Investigate unexpected data in target
Using the Reports¶
Quick Assessment¶
- Filter by STATUS column: Focus on
FAILURE,WARNING,NOT_FOUND_SOURCE, andNOT_FOUND_TARGETrows - Group by VALIDATION_TYPE: Assess schema issues separately from metrics issues
- Group by TABLE: Identify which tables have the most issues
Investigating Failures¶
For Schema Validation:
- Review the
EVALUATION_CRITERIAto understand what attribute failed - Compare
SOURCE_VALUEvsSNOWFLAKE_VALUE - Check if differences are acceptable (e.g., VARCHAR size increase)
- Update type mappings or schema definitions if needed
For Metrics Validation:
- Review the metric that failed (e.g.,
row_count,max,min) - Calculate the difference magnitude
- Determine if within acceptable business tolerance
- Use the detailed queries to investigate source of discrepancy
For Row Validation:
- Open the table-specific row validation report
- Identify rows with
FAILUREstatus - Use the provided
SOURCE_QUERYandTARGET_QUERYto retrieve actual row data - Compare column-by-column to identify specific field differences
- Investigate why values differ (data type conversion, truncation, transformation)
Configuration Options Affecting Reports¶
Tolerance Settings¶
The comparison_configuration.tolerance setting affects metrics validation:
- Values within tolerance are marked as SUCCESS
- Values beyond tolerance are marked as FAILURE
Validation Levels¶
Control which validations run and therefore which reports are generated:
Excluded Metrics¶
Exclude specific metrics from validation:
Maximum Failed Rows¶
Limit the number of failed rows reported in row validation:
Report File Locations¶
Reports are generated in the configured output directory:
File naming convention:
- Timestamp format:
YYYY-MM-DD_HH-MM-SS - Row validation reports include table name and unique ID to prevent collisions
Best Practices¶
- Start with Schema Validation: Ensure structural alignment before validating data
- Use Appropriate Tolerance: Set realistic tolerance thresholds for metrics validation
- Selective Row Validation: Enable row validation only for critical tables (resource intensive)
- Iterative Approach: Fix schema issues first, then metrics, then row-level differences
- Document Acceptable Differences: Some type conversions or value transformations may be expected
- Automate Report Analysis: Use scripts to parse CSV reports and flag critical issues
- Preserve Reports: Archive validation reports for audit trails and compliance
Troubleshooting Report Issues¶
All Validations Showing FAILURE¶
Possible Causes:
- Incorrect database/schema mappings in configuration
- Type mapping file not loaded correctly
- Connection to wrong target database
Solution: Verify database_mappings, schema_mappings, and connection settings
Row Validation Shows All NOT_ FOUND_ TARGET¶
Possible Causes:
- Target table empty or not migrated yet
- Incorrect target table name
- Primary key/index columns mismatch
Solution: Verify target table exists and contains data, check column mappings
Metrics Validation Shows Large Differences¶
Possible Causes:
- Incomplete data migration
- Data type conversion issues causing value changes
- Filter/WHERE clause differences between source and target queries
Solution: Review migration logs, verify row counts first, check data transformations
Report File Not Generated¶
Possible Causes:
- Output directory doesn’t exist or lacks write permissions
- Validation configuration has all levels set to false
- Application crashed before report generation
Solution: Check output path permissions, review logs for errors, enable at least one validation level
Troubleshooting¶
Common Issues and Solutions¶
Issue: “Configuration file not found”¶
Symptom:
Solution:
- Verify the file path is correct
- Use absolute paths:
/home/user/configs/validation.yaml - Check file permissions
Issue: “Connection error”¶
Symptom:
Solutions:
-
Verify connection parameters:
-
Check network connectivity:
-
Verify firewall rules allow connections from your machine
-
For SQL Server SSL issues:
Issue: “Invalid parameter”¶
Symptom:
Solution:
Issue: “Table not found”¶
Symptom:
Solutions:
-
Verify fully qualified name:
-
Check case sensitivity:
-
Verify table exists in source database
Issue: “YAML formatting error”¶
Symptom:
Solutions:
-
Check indentation (use spaces, not tabs)
Incorrect example (mixed indentation with tabs):
-
Quote special characters:
-
Validate YAML syntax using online validators or:
Issue: “Validation fails with tolerance errors”¶
Symptom:
Solution: Adjust tolerance in configuration:
Issue: “Out of memory errors with large tables”¶
Solutions:
-
Enable chunking:
-
Reduce thread count:
-
Filter data:
-
Exclude large columns:
Getting Help¶
- Check logs: Review log files in the output directory
- Enable debug logging:
- Review validation reports in the output directory
- Consult documentation: Full Documentation
- Report issues: Email us at:snowconvert-support@snowflake.com