snowflake.snowpark.Table.sampleΒΆ

Table.sample(frac: Optional[float] = None, n: Optional[int] = None, *, seed: Optional[int] = None, sampling_method: Optional[str] = None) β†’ DataFrame[source]ΒΆ

Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.

Sampling with a seed is not supported on views or subqueries. This method works on tables so it supports seed. This is the main difference between DataFrame.sample() and this method.

Parameters:
  • frac – The percentage of rows to be sampled.

  • n – The fixed number of rows to sample in the range of 0 to 1,000,000 (inclusive). Either frac or n should be provided.

  • seed – Specifies a seed value to make the sampling deterministic. Can be any integer between 0 and 2147483647 inclusive. Default value is None.

  • sampling_method – Specifies the sampling method to use: - β€œBERNOULLI” (or β€œROW”): Includes each row with a probability of p/100. Similar to flipping a weighted coin for each row. - β€œSYSTEM” (or β€œBLOCK”): Includes each block of rows with a probability of p/100. Similar to flipping a weighted coin for each block of rows. This method does not support fixed-size sampling. Default is None. Then the Snowflake database will use β€œROW” by default.

Note

  • SYSTEM | BLOCK sampling is often faster than BERNOULLI | ROW sampling.

  • Sampling without a seed is often faster than sampling with a seed.

  • Fixed-size sampling can be slower than equivalent fraction-based sampling because fixed-size sampling prevents some query optimization.

  • Fixed-size sampling doesn’t work with SYSTEM | BLOCK sampling.