Snowpark Checkpoints library: Hypothesis¶
Hypothesis Unit Testing¶
Hypothesis is a powerful testing library for Python that is designed to enhance traditional unit testing by generating a wide range of input data automatically. It uses property-based testing, where instead of specifying individual test cases, you can describe the expected behavior of your code with properties or conditions and Hypothesis generates examples to test those properties thoroughly. This approach helps uncover edge cases and unexpected behaviors, making it especially effective for complex functions. For more information, see Hypothesis.
The snowpark-checkpoints-hypothesis
package extends the Hypothesis library to generate synthetic Snowpark DataFrames for testing purposes. By leveraging Hypothesis’ ability to generate diverse and randomized test data, you can create Snowpark DataFrames with varying schemas and values to simulate real-world scenarios and uncover edge cases, ensuring robust code and verifying the correctness of complex transformations.
The Hypothesis strategy for Snowpark relies on Pandera for generating synthetic data. The dataframe_strategy
function uses the specified schema to generate a Pandas DataFrame that conforms to it and then converts it into a Snowpark DataFrame.
Function signature:
def dataframe_strategy(
schema: Union[str, DataFrameSchema],
session: Session,
size: Optional[int] = None
) -> SearchStrategy[DataFrame]
Function parameters:
schema
: The schema that defines the columns, data types and checks that the generated Snowpark dataframe should match. The schema can be:A path to a JSON schema file generated by the
collect_dataframe_checkpoint
function of thesnowpark-checkpoints-collectors
package.An instance of pandera.api.pandas.container.DataFrameSchema.
session
: An instance of snowflake.snowpark.Session that will be used for creating the Snowpark DataFrames.size
: The number of rows to generate for each Snowpark DataFrame. If this parameter is not provided, the strategy will generate DataFrames of different sizes.
Function output:
Returns a Hypothesis SearchStrategy that generates Snowpark DataFrames.
Supported and unsupported data types¶
The dataframe_strategy
function supports the generation of Snowpark DataFrames with different data types. Depending on the type of the schema argument passed to the function, the data types supported by the strategy will vary. Note that if the strategy finds an unsupported data type it will raise an exception.
The following table shows the supported and unsupported PySpark data types by the dataframe_strategy
function when passing a JSON file as the schema
argument.
PySpark data type |
Supported |
---|---|
Yes |
|
Yes |
|
No |
|
Yes |
|
No |
|
No |
|
No |
|
No |
|
Yes |
|
Yes |
|
No |
|
Yes |
|
Yes |
|
No |
|
No |
The following table shows the Pandera data types supported by the dataframe_strategy
function when passing a DataFrameSchema object as the schema
argument and the Snowpark data types they are mapped to.
Pandera data type |
Snowpark data type |
---|---|
int8 |
|
int16 |
|
int32 |
|
int64 |
|
float32 |
|
float64 |
|
string |
|
bool |
|
datetime64[ns, tz] |
|
datetime64[ns] |
|
date |
Examples¶
The typical workflow for using the Hypothesis library to generate Snowpark DataFrames is as follows:
Create a standard Python test function with the different assertions or conditions your code should satisfy for all inputs.
Add the Hypothesis
@given
decorator to your test function and pass thedataframe_strategy
function as an argument. For more information about the@given
decorator, see hypothesis.given.Run the test function. When the test is executed, Hypothesis will automatically provide the generated inputs as arguments to the test.
Example 1: Generate Snowpark DataFrames from a JSON file
Below is an example of how to generate Snowpark DataFrames from a JSON schema file generated by the collect_dataframe_checkpoint
function of the snowpark-checkpoints-collectors
package.
from hypothesis import given
from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session
@given(
df=dataframe_strategy(
schema="path/to/file.json",
session=Session.builder.getOrCreate(),
size=10,
)
)
def test_my_function_from_json_file(df: DataFrame):
# Test a particular function using the generated Snowpark DataFrame
...
Example 2: Generate a Snowpark DataFrame from a Pandera DataFrameSchema object
Below is an example of how to generate Snowpark DataFrames from an instance of a Pandera DataFrameSchema. For more information, see Pandera DataFrameSchema.
import pandera as pa
from hypothesis import given
from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session
@given(
df=dataframe_strategy(
schema=pa.DataFrameSchema(
{
"boolean_column": pa.Column(bool),
"integer_column": pa.Column("int64", pa.Check.in_range(0, 9)),
"float_column": pa.Column(pa.Float32, pa.Check.in_range(10.5, 20.5)),
}
),
session=Session.builder.getOrCreate(),
size=10,
)
)
def test_my_function_from_dataframeschema_object(df: DataFrame):
# Test a particular function using the generated Snowpark DataFrame
...
Example 3: Customize the Hypothesis behavior
You can also customize the behavior of your test with the Hypothesis @settings
decorator. This decorator allows you to customize various configuration parameters to tailor test behavior to your needs. By using the @settings
decorator you can control aspects like the maximum number of test cases, the deadline for each test execution, verbosity levels and many others. For more information, see Hypothesis settings.
from datetime import timedelta
from hypothesis import given, settings
from snowflake.snowpark import DataFrame, Session
from snowflake.hypothesis_snowpark import dataframe_strategy
@given(
df=dataframe_strategy(
schema="path/to/file.json",
session=Session.builder.getOrCreate(),
)
)
@settings(
deadline=timedelta(milliseconds=800),
max_examples=25,
)
def test_my_function(df: DataFrame):
# Test a particular function using the generated Snowpark DataFrame
...