Snowpark Checkpoints library: Hypothesis¶

Hypothesis Unit Testing¶

Hypothesis is a powerful testing library for Python that is designed to enhance traditional unit testing by generating a wide range of input data automatically. It uses property-based testing, where instead of specifying individual test cases, you can describe the expected behavior of your code with properties or conditions and Hypothesis generates examples to test those properties thoroughly. This approach helps uncover edge cases and unexpected behaviors, making it especially effective for complex functions. For more information, see Hypothesis.

The snowpark-checkpoints-hypothesis package extends the Hypothesis library to generate synthetic Snowpark DataFrames for testing purposes. By leveraging Hypothesis’ ability to generate diverse and randomized test data, you can create Snowpark DataFrames with varying schemas and values to simulate real-world scenarios and uncover edge cases, ensuring robust code and verifying the correctness of complex transformations.

The Hypothesis strategy for Snowpark relies on Pandera for generating synthetic data. The dataframe_strategy function uses the specified schema to generate a Pandas DataFrame that conforms to it and then converts it into a Snowpark DataFrame.

Function signature:

def dataframe_strategy(
  schema: Union[str, DataFrameSchema],
  session: Session,
  size: Optional[int] = None
) -> SearchStrategy[DataFrame]

Copy

Function parameters:

schema: The schema that defines the columns, data types and checks that the generated Snowpark dataframe should match. The schema can be:
- A path to a JSON schema file generated by the collect_dataframe_checkpoint function of the snowpark-checkpoints-collectors package.
- An instance of pandera.api.pandas.container.DataFrameSchema.
session: An instance of snowflake.snowpark.Session that will be used for creating the Snowpark DataFrames.
size: The number of rows to generate for each Snowpark DataFrame. If this parameter is not provided, the strategy will generate DataFrames of different sizes.

Function output:

Returns a Hypothesis SearchStrategy that generates Snowpark DataFrames.

Supported and unsupported data types¶

The dataframe_strategy function supports the generation of Snowpark DataFrames with different data types. Depending on the type of the schema argument passed to the function, the data types supported by the strategy will vary. Note that if the strategy finds an unsupported data type it will raise an exception.

The following table shows the supported and unsupported PySpark data types by the dataframe_strategy function when passing a JSON file as the schema argument.

PySpark data type	Supported
Array	Yes
Boolean	Yes
Char	No
Date	Yes
DayTimeIntervalType	No
Decimal	No
Map	No
Null	No
Byte, Short, Integer, Long, Float, Double	Yes
String	Yes
Struct	No
Timestamp	Yes
TimestampNTZ	Yes
Varchar	No
YearMonthIntervalType	No

The following table shows the Pandera data types supported by the dataframe_strategy function when passing a DataFrameSchema object as the schema argument and the Snowpark data types they are mapped to.

Pandera data type	Snowpark data type
int8	ByteType
int16	ShortType
int32	IntegerType
int64	LongType
float32	FloatType
float64	DoubleType
string	StringType
bool	BooleanType
datetime64[ns, tz]	TimestampType(TZ)
datetime64[ns]	TimestampType(NTZ)
date	DateType

Examples¶

The typical workflow for using the Hypothesis library to generate Snowpark DataFrames is as follows:

Create a standard Python test function with the different assertions or conditions your code should satisfy for all inputs.
Add the Hypothesis @given decorator to your test function and pass the dataframe_strategy function as an argument. For more information about the @given decorator, see hypothesis.given.
Run the test function. When the test is executed, Hypothesis will automatically provide the generated inputs as arguments to the test.

Example 1: Generate Snowpark DataFrames from a JSON file

Below is an example of how to generate Snowpark DataFrames from a JSON schema file generated by the collect_dataframe_checkpoint function of the snowpark-checkpoints-collectors package.

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_json_file(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

Copy

Example 2: Generate a Snowpark DataFrame from a Pandera DataFrameSchema object

Below is an example of how to generate Snowpark DataFrames from an instance of a Pandera DataFrameSchema. For more information, see Pandera DataFrameSchema.

import pandera as pa

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema=pa.DataFrameSchema(
            {
                "boolean_column": pa.Column(bool),
                "integer_column": pa.Column("int64", pa.Check.in_range(0, 9)),
                "float_column": pa.Column(pa.Float32, pa.Check.in_range(10.5, 20.5)),
            }
        ),
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_dataframeschema_object(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

Copy

Example 3: Customize the Hypothesis behavior

You can also customize the behavior of your test with the Hypothesis @settings decorator. This decorator allows you to customize various configuration parameters to tailor test behavior to your needs. By using the @settings decorator you can control aspects like the maximum number of test cases, the deadline for each test execution, verbosity levels and many others. For more information, see Hypothesis settings.

from datetime import timedelta

from hypothesis import given, settings
from snowflake.snowpark import DataFrame, Session

from snowflake.hypothesis_snowpark import dataframe_strategy


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
    )
)
@settings(
    deadline=timedelta(milliseconds=800),
    max_examples=25,
)
def test_my_function(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

Copy