Hybrid Execution¶
Snowpark pandas supports workloads on mixed underlying execution engines and will automatically move data to the most appropriate engine for a given dataset size and operation. Currently you can use either Snowflake or local pandas to back a DataFrame object. Decisions on when to move data are dominated by dataset size.
For Snowflake, specific API calls will trigger hybrid backend evaluation. These are registered as either a pre-operation switch point or a post-operation switch point. These switch points may change over time as the feature matures and as APIs are updated.
Only Dataframes with data types that are compatible with Snowflake can be moved to the Snowflake backend, and if a Dataframe backed by pandas contains such a type (‘categorical’ for example) then astype should be used to coerce that type before the backend can be changed.
Example Pre-Operation Switchpoints:¶
apply, iterrows, itertuples, items, plot, quantile, __init__, plot, quantile, T, read_csv, read_json, concat, merge
Many methods that are not yet implemented in Snowpark pandas are also registered as pre-operation switch points, and will automatically move data to local pandas for execution when called. This includes most methods that are ordinarily completely unsupported by Snowpark pandas, and have N in their implemented status in the DataFrame and Series supported API lists.
Applying Snowpark functions like floor and abs, or Snowflake Cortex functions like snowflake.cortex.sentiment, to a DataFrame or Series on the Pandas backend will automatically move data to the Snowflake backend for execution.
Post-Operation Switchpoints:¶
read_snowflake, value_counts, tail, var, std, sum, sem, max, min, mean, agg, aggregate, count, nunique, cummax, cummin, cumprod, cumsum
Examples¶
Disabling or Enabling Hybrid Execution¶
Manually Changing DataFrame Backends¶
Configuring Local Pandas Backend¶
Currently the auto switching behavior is dominated by dataset size, with some exceptions for specific operations. The default limit for running workloads on the local pandas backend is 10M rows. This can be configured through the modin environment variables:
Configuring Transfer Costs¶
Transfer costs are also considered for data moving between engines. For data moving from Snowflake this threshold can be configured with the SnowflakePandasTransferThreshold environment variable. This is set to 100k rows by default; which will penalize the movement of data as it nears this threshold. The default may change in the future.
Debugging Hybrid Execution¶
pd.explain_switch() provides information on how execution engine decisions are made. This method prints a simplified version of the command unless simple=False is passed as an argument.
Performance Considerations¶
Hybrid mode will generally perform well with small datasets and traditional notebook workloads, but merge-heavy workloads using a star schema can result in moving data too often, particularly when tables in the star schema straddle the transfer-cost boundary. Since the Snowflake Warehouse is designed for these SQL-like workloads turning off hybrid mode may be desirable.