snowflake.snowpark.DataFrame.randomSplit¶

DataFrame.randomSplit(weights: List[float], seed: Optional[int] = None, *, statement_params: Optional[Dict[str, str]] = None) → List[DataFrame][source]¶

Randomly splits the current DataFrame into separate DataFrames, using the specified weights.

Parameters:

weights – Weights to use for splitting the DataFrame. If the weights don’t add up to 1, the weights will be normalized. Every number in weights has to be positive. If only one weight is specified, the returned DataFrame list only includes the current DataFrame.
seed –
The seed used by the randomness generator for splitting.

Caution

By default, reusing a seed value doesn’t guarantee reproducible results.
statement_params – Dictionary of statement level parameters to be set while executing this action.

Example:

>>> df = session.range(10000)
>>> weights = [0.1, 0.2, 0.3]
>>> df_parts = df.random_split(weights)
>>> len(df_parts) == len(weights)
True

Note

1. When multiple weights are specified, the current DataFrame will be cached before being split.

2. When a weight or a normailized weight is less than 1e-6, the corresponding split dataframe will be empty.

3. To get reproducible seeding behavior, configure the DataFrame’s Session to use simplified querying:

>>> session.conf.set("use_simplified_query_generation", True)