snowflake.snowpark.Table.sample¶

Table.sample(frac: Optional[float] = None, n: Optional[int] = None, *, seed: Optional[int] = None, sampling_method: Optional[str] = None) → DataFrame[source]¶

Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.

Sampling with a seed is not supported on views or subqueries. This method works on tables so it supports seed. This is the main difference between DataFrame.sample() and this method.

Parameters:

frac – The percentage of rows to be sampled.
n – The fixed number of rows to sample in the range of 0 to 1,000,000 (inclusive). Either frac or n should be provided.
seed – Specifies a seed value to make the sampling deterministic. Can be any integer between 0 and 2147483647 inclusive. Default value is None.
sampling_method – Specifies the sampling method to use: - “BERNOULLI” (or “ROW”): Includes each row with a probability of p/100. Similar to flipping a weighted coin for each row. - “SYSTEM” (or “BLOCK”): Includes each block of rows with a probability of p/100. Similar to flipping a weighted coin for each block of rows. This method does not support fixed-size sampling. Default is None. Then the Snowflake database will use “ROW” by default.

Note

SYSTEM | BLOCK sampling is often faster than BERNOULLI | ROW sampling.
Sampling without a seed is often faster than sampling with a seed.
Fixed-size sampling can be slower than equivalent fraction-based sampling because fixed-size sampling prevents some query optimization.
Fixed-size sampling doesn’t work with SYSTEM | BLOCK sampling.