You are viewing documentation about an older version (1.4.0). View latest version

snowflake.snowpark.Table.sample

Table.sample(frac: float | None = None, n: int | None = None, *, seed: float | None = None, sampling_method: str | None = None) DataFrame[source]

Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.

Sampling with a seed is not supported on views or subqueries. This method works on tables so it supports seed. This is the main difference between DataFrame.sample() and this method.

Parameters:
  • frac – The percentage of rows to be sampled.

  • n – The fixed number of rows to sample in the range of 0 to 1,000,000 (inclusive). Either frac or n should be provided.

  • seed – Specifies a seed value to make the sampling deterministic. Can be any integer between 0 and 2147483647 inclusive. Default value is None.

  • sampling_method – Specifies the sampling method to use: - “BERNOULLI” (or “ROW”): Includes each row with a probability of p/100. Similar to flipping a weighted coin for each row. - “SYSTEM” (or “BLOCK”): Includes each block of rows with a probability of p/100. Similar to flipping a weighted coin for each block of rows. This method does not support fixed-size sampling. Default is None. Then the Snowflake database will use “ROW” by default.

Note

  • SYSTEM | BLOCK sampling is often faster than BERNOULLI | ROW sampling.

  • Sampling without a seed is often faster than sampling with a seed.

  • Fixed-size sampling can be slower than equivalent fraction-based sampling because fixed-size sampling prevents some query optimization.

  • Fixed-size sampling doesn’t work with SYSTEM | BLOCK sampling.