Snowpark Checkpoints library: Hypothesis¶

Hypothesis unit testing¶

Hypothesis is a powerful testing library for Python that is designed to enhance traditional unit testing by generating a wide range of input data automatically. It uses property-based testing, where instead of specifying individual test cases, you can describe the expected behavior of your code with properties or conditions, and Hypothesis generates examples to test those properties thoroughly. This approach helps uncover edge cases and unexpected behaviors, making it especially effective for complex functions. For more information, see Hypothesis.

The snowpark-checkpoints-hypothesis package extends the Hypothesis library to generate synthetic Snowpark DataFrames for testing purposes. By leveraging the ability of Hypothesis to generate diverse and randomized test data, you can create Snowpark DataFrames with varying schemas and values to simulate real-world scenarios, thus ensuring robust code and verifying the correctness of complex transformations.

The Hypothesis strategy for Snowpark relies on pandera for generating synthetic data. The dataframe_strategy function uses the specified schema to generate a pandas DataFrame that conforms to it and then converts it into a Snowpark DataFrame.

Function signature:

def dataframe_strategy(
  schema: Union[str, DataFrameSchema],
  session: Session,
  size: Optional[int] = None
) -> SearchStrategy[DataFrame]

Copy

Function parameters:

schema: The schema that defines the columns, data types, and checks that the generated Snowpark dataframe should match

The schema can be:
- A path to a JSON schema file generated by the collect_dataframe_checkpoint function of the snowpark-checkpoints-collectors package
- An instance of pandera.api.pandas.container.DataFrameSchema
session: An instance of snowflake.snowpark.Session that will be used for creating the Snowpark DataFrames
size: The number of rows to generate for each Snowpark DataFrame

If this parameter is not provided, the strategy will generate DataFrames of different sizes.

Function output:

Returns a Hypothesis SearchStrategy that generates Snowpark DataFrames

Supported and unsupported data types¶

The dataframe_strategy function supports the generation of Snowpark DataFrames with different data types, which vary depending on the type of the schema argument passed to the function. Note that the strategy will raise an exception if it finds an unsupported data type.

The following table shows the supported and unsupported PySpark data types by the dataframe_strategy function when a JSON file is passed as the schema argument:

PySpark data type	Supported
Array	Yes
Boolean	Yes
Char	No
Date	Yes
DayTimeIntervalType	No
Decimal	No
Map	No
Null	No
Byte, Short, Integer, Long, Float, Double	Yes
String	Yes
Struct	No
Timestamp	Yes
TimestampNTZ	Yes
Varchar	No
YearMonthIntervalType	No

The following table shows the pandera data types supported by the dataframe_strategy function when a DataFrameSchema object is passed as the schema argument and the Snowpark data types they are mapped to:

Pandera data type	Snowpark data type
int8	ByteType
int16	ShortType
int32	IntegerType
int64	LongType
float32	FloatType
float64	DoubleType
string	StringType
bool	BooleanType
datetime64[ns, tz]	TimestampType(TZ)
datetime64[ns]	TimestampType(NTZ)
date	DateType

Examples¶

The following procedure presents the typical workflow for using the Hypothesis library to generate Snowpark DataFrames:

Create a standard Python test function with the different assertions or conditions your code should satisfy for all inputs.
Add the Hypothesis @given decorator to your test function, and pass the dataframe_strategy function as an argument.

For more information about the @given decorator, see hypothesis.given.
Run the test function.

Hypothesis automatically provides the generated inputs as arguments to the test.

Example 1: Generate Snowpark DataFrames from a JSON file

In this example, Snowpark DataFrames are generated from a JSON schema file generated by the collect_dataframe_checkpoint function of the snowpark-checkpoints-collectors package:

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_json_file(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

Copy

Example 2: Generate a Snowpark DataFrame from a pandera DataFrameSchema object

In this example, Snowpark DataFrames are generated from an instance of a pandera DataFrameSchema:

import pandera as pa

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema=pa.DataFrameSchema(
            {
                "boolean_column": pa.Column(bool),
                "integer_column": pa.Column("int64", pa.Check.in_range(0, 9)),
                "float_column": pa.Column(pa.Float32, pa.Check.in_range(10.5, 20.5)),
            }
        ),
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_dataframeschema_object(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

Copy

For more information, see Pandera DataFrameSchema.

Example 3: Customize the Hypothesis behavior

You can also customize the behavior of your test with the Hypothesis @settings decorator. This decorator allows you to customize various configuration parameters to tailor test behavior to your needs.

By using the @settings decorator, you can control aspects like the maximum number of test cases, the deadline for each test execution, and verbosity levels:

from datetime import timedelta

from hypothesis import given, settings
from snowflake.snowpark import DataFrame, Session

from snowflake.hypothesis_snowpark import dataframe_strategy


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
    )
)
@settings(
    deadline=timedelta(milliseconds=800),
    max_examples=25,
)
def test_my_function(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

Copy

For more information, see Hypothesis settings.