Snowpark Checkpoints library: Hypothesis

Hypothesis unit testing

Hypothesis is a powerful testing library for Python that is designed to enhance traditional unit testing by generating a wide range of input data automatically. It uses property-based testing, where instead of specifying individual test cases, you can describe the expected behavior of your code with properties or conditions, and Hypothesis generates examples to test those properties thoroughly. This approach helps uncover edge cases and unexpected behaviors, making it especially effective for complex functions. For more information, see Hypothesis.

The snowpark-checkpoints-hypothesis package extends the Hypothesis library to generate synthetic Snowpark DataFrames for testing purposes. By leveraging the ability of Hypothesis to generate diverse and randomized test data, you can create Snowpark DataFrames with varying schemas and values to simulate real-world scenarios, thus ensuring robust code and verifying the correctness of complex transformations.

The Hypothesis strategy for Snowpark relies on pandera for generating synthetic data. The dataframe_strategy function uses the specified schema to generate a pandas DataFrame that conforms to it and then converts it into a Snowpark DataFrame.

Function signature:

def dataframe_strategy(
  schema: Union[str, DataFrameSchema],
  session: Session,
  size: Optional[int] = None
) -> SearchStrategy[DataFrame]
Copy

Function parameters:

  • schema: The schema that defines the columns, data types, and checks that the generated Snowpark dataframe should match

    The schema can be:

  • session: An instance of snowflake.snowpark.Session that will be used for creating the Snowpark DataFrames

  • size: The number of rows to generate for each Snowpark DataFrame

    If this parameter is not provided, the strategy will generate DataFrames of different sizes.

Function output:

Returns a Hypothesis SearchStrategy that generates Snowpark DataFrames

Supported and unsupported data types

The dataframe_strategy function supports the generation of Snowpark DataFrames with different data types, which vary depending on the type of the schema argument passed to the function. Note that the strategy will raise an exception if it finds an unsupported data type.

The following table shows the supported and unsupported PySpark data types by the dataframe_strategy function when a JSON file is passed as the schema argument:

PySpark data type

Supported

Array

Yes

Boolean

Yes

Char

No

Date

Yes

DayTimeIntervalType

No

Decimal

No

Map

No

Null

No

Byte, Short, Integer, Long, Float, Double

Yes

String

Yes

Struct

No

Timestamp

Yes

TimestampNTZ

Yes

Varchar

No

YearMonthIntervalType

No

The following table shows the pandera data types supported by the dataframe_strategy function when a DataFrameSchema object is passed as the schema argument and the Snowpark data types they are mapped to:

Pandera data type

Snowpark data type

int8

ByteType

int16

ShortType

int32

IntegerType

int64

LongType

float32

FloatType

float64

DoubleType

string

StringType

bool

BooleanType

datetime64[ns, tz]

TimestampType(TZ)

datetime64[ns]

TimestampType(NTZ)

date

DateType

Examples

The following procedure presents the typical workflow for using the Hypothesis library to generate Snowpark DataFrames:

  1. Create a standard Python test function with the different assertions or conditions your code should satisfy for all inputs.

  2. Add the Hypothesis @given decorator to your test function, and pass the dataframe_strategy function as an argument.

    For more information about the @given decorator, see hypothesis.given.

  3. Run the test function.

    Hypothesis automatically provides the generated inputs as arguments to the test.

Example 1: Generate Snowpark DataFrames from a JSON file

In this example, Snowpark DataFrames are generated from a JSON schema file generated by the collect_dataframe_checkpoint function of the snowpark-checkpoints-collectors package:

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_json_file(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...
Copy

Example 2: Generate a Snowpark DataFrame from a pandera DataFrameSchema object

In this example, Snowpark DataFrames are generated from an instance of a pandera DataFrameSchema:

import pandera as pa

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema=pa.DataFrameSchema(
            {
                "boolean_column": pa.Column(bool),
                "integer_column": pa.Column("int64", pa.Check.in_range(0, 9)),
                "float_column": pa.Column(pa.Float32, pa.Check.in_range(10.5, 20.5)),
            }
        ),
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_dataframeschema_object(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...
Copy

For more information, see Pandera DataFrameSchema.

Example 3: Customize the Hypothesis behavior

You can also customize the behavior of your test with the Hypothesis @settings decorator. This decorator allows you to customize various configuration parameters to tailor test behavior to your needs.

By using the @settings decorator, you can control aspects like the maximum number of test cases, the deadline for each test execution, and verbosity levels:

from datetime import timedelta

from hypothesis import given, settings
from snowflake.snowpark import DataFrame, Session

from snowflake.hypothesis_snowpark import dataframe_strategy


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
    )
)
@settings(
    deadline=timedelta(milliseconds=800),
    max_examples=25,
)
def test_my_function(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...
Copy

For more information, see Hypothesis settings.