DataFrame.randomSplit(weights: List[float], seed: Optional[int] = None, *, statement_params: Optional[Dict[str, str]] = None) List[DataFrame][source]

Randomly splits the current DataFrame into separate DataFrames, using the specified weights.

  • weights – Weights to use for splitting the DataFrame. If the weights don’t add up to 1, the weights will be normalized. Every number in weights has to be positive. If only one weight is specified, the returned DataFrame list only includes the current DataFrame.

  • seed – The seed for sampling.

  • statement_params – Dictionary of statement level parameters to be set while executing this action.


>>> df = session.range(10000)
>>> weights = [0.1, 0.2, 0.3]
>>> df_parts = df.random_split(weights)
>>> len(df_parts) == len(weights)


1. When multiple weights are specified, the current DataFrame will be cached before being split.

2. When a weight or a normailized weight is less than 1e-6, the corresponding split dataframe will be empty.