Snowpark Checkpoints library: Hypothesis¶

Hypothesis Unit Testing¶

Hypothesis is a powerful testing library for Python that is designed to enhance traditional unit testing by generating a wide range of input data automatically. It uses property-based testing, where instead of specifying individual test cases, you can describe the expected behavior of your code with properties or conditions and Hypothesis generates examples to test those properties thoroughly. This approach helps uncover edge cases and unexpected behaviors, making it especially effective for complex functions. For more information, see Hypothesis.

The snowpark-checkpoints-hypothesis package extends the Hypothesis library to generate synthetic Snowpark DataFrames for testing purposes. By leveraging Hypothesis’ ability to generate diverse and randomized test data, you can create Snowpark DataFrames with varying schemas and values to simulate real-world scenarios and uncover edge cases, ensuring robust code and verifying the correctness of complex transformations.

The Hypothesis strategy for Snowpark relies on Pandera for generating synthetic data. The dataframe_strategy function uses the specified schema to generate a Pandas DataFrame that conforms to it and then converts it into a Snowpark DataFrame.

Function signature:

def dataframe_strategy(
  schema: Union[str, DataFrameSchema],
  session: Session,
  size: Optional[int] = None
) -> SearchStrategy[DataFrame]
Copy

Function parameters:

  • schema: The schema that defines the columns, data types and checks that the generated Snowpark dataframe should match. The schema can be:

  • session: An instance of snowflake.snowpark.Session that will be used for creating the Snowpark DataFrames.

  • size: The number of rows to generate for each Snowpark DataFrame. If this parameter is not provided, the strategy will generate DataFrames of different sizes.

Function output:

Returns a Hypothesis SearchStrategy that generates Snowpark DataFrames.

Supported and unsupported data types¶

The dataframe_strategy function supports the generation of Snowpark DataFrames with different data types. Depending on the type of the schema argument passed to the function, the data types supported by the strategy will vary. Note that if the strategy finds an unsupported data type it will raise an exception.

The following table shows the supported and unsupported PySpark data types by the dataframe_strategy function when passing a JSON file as the schema argument.

PySpark data type

Supported

Array

Yes

Boolean

Yes

Char

No

Date

Yes

DayTimeIntervalType

No

Decimal

No

Map

No

Null

No

Byte, Short, Integer, Long, Float, Double

Yes

String

Yes

Struct

No

Timestamp

Yes

TimestampNTZ

Yes

Varchar

No

YearMonthIntervalType

No

The following table shows the Pandera data types supported by the dataframe_strategy function when passing a DataFrameSchema object as the schema argument and the Snowpark data types they are mapped to.

Pandera data type

Snowpark data type

int8

ByteType

int16

ShortType

int32

IntegerType

int64

LongType

float32

FloatType

float64

DoubleType

string

StringType

bool

BooleanType

datetime64[ns, tz]

TimestampType(TZ)

datetime64[ns]

TimestampType(NTZ)

date

DateType

Examples¶

The typical workflow for using the Hypothesis library to generate Snowpark DataFrames is as follows:

  1. Create a standard Python test function with the different assertions or conditions your code should satisfy for all inputs.

  2. Add the Hypothesis @given decorator to your test function and pass the dataframe_strategy function as an argument. For more information about the @given decorator, see hypothesis.given.

  3. Run the test function. When the test is executed, Hypothesis will automatically provide the generated inputs as arguments to the test.

Example 1: Generate Snowpark DataFrames from a JSON file

Below is an example of how to generate Snowpark DataFrames from a JSON schema file generated by the collect_dataframe_checkpoint function of the snowpark-checkpoints-collectors package.

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_json_file(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...
Copy

Example 2: Generate a Snowpark DataFrame from a Pandera DataFrameSchema object

Below is an example of how to generate Snowpark DataFrames from an instance of a Pandera DataFrameSchema. For more information, see Pandera DataFrameSchema.

import pandera as pa

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema=pa.DataFrameSchema(
            {
                "boolean_column": pa.Column(bool),
                "integer_column": pa.Column("int64", pa.Check.in_range(0, 9)),
                "float_column": pa.Column(pa.Float32, pa.Check.in_range(10.5, 20.5)),
            }
        ),
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_dataframeschema_object(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...
Copy

Example 3: Customize the Hypothesis behavior

You can also customize the behavior of your test with the Hypothesis @settings decorator. This decorator allows you to customize various configuration parameters to tailor test behavior to your needs. By using the @settings decorator you can control aspects like the maximum number of test cases, the deadline for each test execution, verbosity levels and many others. For more information, see Hypothesis settings.

from datetime import timedelta

from hypothesis import given, settings
from snowflake.snowpark import DataFrame, Session

from snowflake.hypothesis_snowpark import dataframe_strategy


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
    )
)
@settings(
    deadline=timedelta(milliseconds=800),
    max_examples=25,
)
def test_my_function(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...
Copy