Dataset support for Snowpark Connect for Spark (Java/Scala)¶

class Dataset[T] extends Serializable

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.

Operations available on Datasets are divided into transformations and actions. Transformations produce new Datasets, and actions trigger computation and return results. Example transformations include map, filter, select, and groupBy. Example actions include count, show, and collect.

Snowpark Connect for Spark supports the Spark 3.5 Dataset API for Java and Scala. There is only one JVM client, so there’s no significant support difference between using Java or Scala. All supported and unsupported APIs are described in this topic.

For detailed Dataframe API support, see DataFrame support for Snowpark Connect for Spark.

Methods¶

The following table lists all Dataset methods and their support status in Snowpark Connect for Spark.

Method	Description
agg(Column, Column*)	Aggregates on the entire Dataset without groups. Shorthand for `groupBy().agg()`.
agg(Map[String, String])	(Scala-specific) Aggregates on the entire Dataset without groups, using a map of column names to aggregate functions.
agg(java.util.Map[String, String])	(Java-specific) Aggregates on the entire Dataset without groups, using a map of column names to aggregate functions.
agg((String, String), (String, String)*)	(Scala-specific) Aggregates on the entire Dataset without groups, using pairs of column names and aggregate function names.
alias(String)	Returns a new Dataset with an alias set. Same as `as`.
alias(Symbol)	(Scala-specific) Returns a new Dataset with an alias set.
apply(String)	Selects column based on the column name and returns it as a `Column`.
as(String)	Returns a new Dataset with an alias set.
as(Symbol)	(Scala-specific) Returns a new Dataset with an alias set.
as[U](Encoder[U])	Returns a new Dataset where each record has been mapped on to the specified type `U`.
cache()	Persists this Dataset with the default storage level (`MEMORY_AND_DISK`).
coalesce(Int)	Returns a new Dataset that has exactly `numPartitions` partitions, when fewer partitions are requested.
col(String)	Selects column based on the column name and returns it as a `Column`.
colRegex(String)	Selects column based on the column name specified as a regex and returns it as a `Column`.
collect()	Returns an `Array[T]` that contains all rows in this Dataset.
collectAsList()	Returns a `java.util.List[T]` that contains all rows in this Dataset.
count()	Returns the number of rows in the Dataset as a `Long`.
createGlobalTempView(String)	Creates a global temporary view using the given name.
createOrReplaceGlobalTempView(String)	Creates or replaces a global temporary view using the given name.
createOrReplaceTempView(String)	Creates a local temporary view using the given name.
createTempView(String)	Creates a local temporary view using the given name.
crossJoin(Dataset[_])	Explicit cartesian join with another `DataFrame`.
cube(String, String*)	Creates a multi-dimensional cube for the current Dataset using column names for running aggregations.
cube(Column*)	Creates a multi-dimensional cube for the current Dataset using `Column` expressions for running aggregations.
describe(String*)	Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.
distinct()	Returns a new Dataset that contains only the unique rows. This is an alias for `dropDuplicates`.
drop(String)	Returns a new Dataset with a column dropped by name. This is a no-op if the schema doesn’t contain the column name.
drop(String*)	Returns a new Dataset with multiple columns dropped by name.
drop(Column)	Returns a new Dataset with a column dropped. Accepts a `Column` rather than a name.
drop(Column, Column*)	Returns a new Dataset with multiple columns dropped using `Column` expressions.
dropDuplicates()	Returns a new Dataset with duplicate rows removed.
dropDuplicates(Seq[String])	(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
dropDuplicates(Array[String])	(Java-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
dropDuplicates(String, String*)	Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
except(Dataset[T])	Returns a new Dataset containing rows in this Dataset but not in another Dataset. Equivalent to `EXCEPT DISTINCT` in SQL.
exceptAll(Dataset[T])	Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving duplicates. Equivalent to `EXCEPT ALL` in SQL.
explain()	Prints the physical plan to the console for debugging purposes.
explain(Boolean)	Prints the plans (logical and physical) to the console for debugging purposes.
explain(String)	Prints the plans with a format specified by a given explain mode (`simple`, `extended`, `codegen`, `cost`, `formatted`).
filter(Column)	Filters rows using the given `Column` condition.
filter(String)	Filters rows using the given SQL expression string.
filter(FilterFunction[T])	(Java-specific) Returns a new Dataset that only contains elements where `func` returns `true`.
filter(T => Boolean)	(Scala-specific) Returns a new Dataset that only contains elements where `func` returns `true`.
first()	Returns the first row. Alias for `head()`.
flatMap[U](FlatMapFunction[T, U], Encoder[U])	(Java-specific) Returns a new Dataset by first applying a function to all elements and then flattening the results.
flatMap[U](T => TraversableOnce[U])(Encoder[U])	(Scala-specific) Returns a new Dataset by first applying a function to all elements and then flattening the results.
foreach(ForeachFunction[T])	(Java-specific) Runs `func` on each element of this Dataset.
foreach(T => Unit)	(Scala-specific) Applies a function to all rows.
foreachPartition(ForeachPartitionFunction[T])	(Java-specific) Runs `func` on each partition of this Dataset.
foreachPartition(Iterator[T] => Unit)	(Scala-specific) Applies a function to each partition of this Dataset.
groupBy(Column*)	Groups the Dataset using the specified `Column` expressions for running aggregations.
groupBy(String, String*)	Groups the Dataset using the specified column names for running aggregations.
groupByKey[K](MapFunction[T, K], Encoder[K])	(Java-specific) Returns a `KeyValueGroupedDataset` where the data is grouped by the given key function.
groupByKey[K](T => K)(Encoder[K])	(Scala-specific) Returns a `KeyValueGroupedDataset` where the data is grouped by the given key function.
head()	Returns the first row.
head(Int)	Returns the first `n` rows as an `Array[T]`.
hint(String, Any*)	Specifies some hint on the current Dataset (for example, broadcast hint for joins).
intersect(Dataset[T])	Returns a new Dataset containing rows only in both this Dataset and another Dataset. Equivalent to `INTERSECT` in SQL.
intersectAll(Dataset[T])	Returns a new Dataset containing rows only in both Datasets while preserving duplicates. Equivalent to `INTERSECT ALL` in SQL.
join(Dataset[_])	Joins with another `DataFrame`. Behaves as an inner join and requires a subsequent join predicate.
join(Dataset[_], String)	Inner equi-join with another `DataFrame` using the given column name.
join(Dataset[_], Seq[String])	(Scala-specific) Inner equi-join with another `DataFrame` using the given column names.
join(Dataset[_], Array[String])	(Java-specific) Inner equi-join with another `DataFrame` using the given column names.
join(Dataset[_], String, String)	Equi-join with another `DataFrame` using the given column name and join type.
join(Dataset[_], Seq[String], String)	(Scala-specific) Equi-join with another `DataFrame` using the given column names and join type.
join(Dataset[_], Array[String], String)	(Java-specific) Equi-join with another `DataFrame` using the given column names and join type.
join(Dataset[_], Column)	Inner join with another `DataFrame` using the given join expression.
join(Dataset[_], Column, String)	Joins with another `DataFrame` using the given join expression and join type.
joinWith[U](Dataset[U], Column)	Inner equi-join to join this Dataset, returning a `Dataset[(T, U)]` for each pair where the condition evaluates to `true`.
joinWith[U](Dataset[U], Column, String)	Joins this Dataset returning a `Dataset[(T, U)]` for each pair where the condition evaluates to `true`, using the specified join type.
limit(Int)	Returns a new Dataset by taking the first `n` rows.
map[U](MapFunction[T, U], Encoder[U])	(Java-specific) Returns a new Dataset that contains the result of applying `func` to each element.
map[U](T => U)(Encoder[U])	(Scala-specific) Returns a new Dataset that contains the result of applying `func` to each element.
mapPartitions[U](MapPartitionsFunction[T, U], Encoder[U])	(Java-specific) Returns a new Dataset that contains the result of applying `func` to each partition.
mapPartitions[U](Iterator[T] => Iterator[U])(Encoder[U])	(Scala-specific) Returns a new Dataset that contains the result of applying `func` to each partition.
melt(Array[Column], Array[Column], String, String)	Unpivots a `DataFrame` from wide format to long format. This is an alias for `unpivot`.
melt(Array[Column], String, String)	Unpivots a `DataFrame` from wide format to long format, where values are set to all non-id columns.
metadataColumn(String)	Selects a metadata column based on its logical column name and returns it as a `Column`.
observe(String, Column, Column*)	Defines named metrics to observe on the Dataset.
observe(Observation, Column, Column*)	Observes named metrics through an `Observation` instance.
offset(Int)	Returns a new Dataset by skipping the first `n` rows.
orderBy(Column*)	Returns a new Dataset sorted by the given `Column` expressions. This is an alias for `sort`.
orderBy(String, String*)	Returns a new Dataset sorted by the given column names.
persist()	Persists this Dataset with the default storage level (`MEMORY_AND_DISK`).
persist(StorageLevel)	Persists this Dataset with the given `StorageLevel`.
printSchema()	Prints the schema to the console in a nice tree format.
printSchema(Int)	Prints the schema up to the given level to the console in a nice tree format.
repartition(Int)	Returns a new Dataset that has exactly `numPartitions` partitions.
repartition(Int, Column*)	Returns a new Dataset hash-partitioned by the given `Column` expressions into `numPartitions`.
repartition(Column*)	Returns a new Dataset hash-partitioned by the given partitioning `Column` expressions.
repartitionByRange(Int, Column*)	Returns a new Dataset range-partitioned by the given `Column` expressions into `numPartitions`.
repartitionByRange(Column*)	Returns a new Dataset range-partitioned by the given partitioning `Column` expressions.
rollup(String, String*)	Creates a multi-dimensional rollup for the current Dataset using column names for running aggregations.
rollup(Column*)	Creates a multi-dimensional rollup for the current Dataset using `Column` expressions for running aggregations.
sameSemantics(Dataset[T])	Returns `true` when the logical query plans inside both Datasets are equal and therefore return same results.
sample(Double)	Returns a new Dataset by sampling a fraction of rows (without replacement).
sample(Double, Long)	Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.
sample(Boolean, Double)	Returns a new Dataset by sampling a fraction of rows, using a random seed.
sample(Boolean, Double, Long)	Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
select(Column*)	Selects a set of column-based expressions.
select(String, String*)	Selects a set of columns by name.
select[U1](TypedColumn[T, U1])	Returns a new Dataset by computing the given `TypedColumn` expression for each element.
selectExpr(String*)	Selects a set of SQL expressions. This is a variant of `select` that accepts SQL expression strings.
semanticHash()	Returns a `hashCode` of the logical query plan against this Dataset.
show()	Displays the top 20 rows of the Dataset in a tabular form.
show(Int)	Displays the Dataset in a tabular form, showing `numRows` rows.
show(Boolean)	Displays the top 20 rows with truncation control.
show(Int, Boolean)	Displays the Dataset in a tabular form with truncation control.
show(Int, Int)	Displays the Dataset in a tabular form with truncation to a specific character count.
show(Int, Int, Boolean)	Displays the Dataset in a tabular form with truncation and vertical display options.
sort(Column*)	Returns a new Dataset sorted by the given `Column` expressions.
sort(String, String*)	Returns a new Dataset sorted by the specified column names, all in ascending order.
summary(String*)	Computes specified statistics for numeric and string columns. Available statistics include count, mean, stddev, min, max, arbitrary percentiles, count_distinct, and approx_count_distinct.
tail(Int)	Returns the last `n` rows in the Dataset as an `Array[T]`.
take(Int)	Returns the first `n` rows in the Dataset as an `Array[T]`.
takeAsList(Int)	Returns the first `n` rows in the Dataset as a `java.util.List[T]`.
to(StructType)	Returns a new `DataFrame` where each row is reconciled to match the specified schema.
toDF()	Converts this strongly typed collection of data to a generic `DataFrame`.
toDF(String*)	Converts this strongly typed collection of data to a generic `DataFrame` with columns renamed.
transform[U](Dataset[T] => Dataset[U])	Concise syntax for chaining custom transformations.
union(Dataset[T])	Returns a new Dataset containing the union of rows in this Dataset and another Dataset. Equivalent to `UNION ALL` in SQL. Resolves columns by position.
unionAll(Dataset[T])	Returns a new Dataset containing the union of rows in this Dataset and another Dataset. This is an alias for `union`.
unionByName(Dataset[T])	Returns a new Dataset containing the union of rows in this Dataset and another Dataset. Resolves columns by name (not by position).
unionByName(Dataset[T], Boolean)	Returns a new Dataset containing the union of rows, with support for missing columns. Missing columns are filled with null.
unpersist()	Marks the Dataset as non-persistent and removes all blocks for it from memory and disk.
unpersist(Boolean)	Marks the Dataset as non-persistent, optionally blocking until all blocks are deleted.
unpivot(Array[Column], Array[Column], String, String)	Unpivots a `DataFrame` from wide format to long format, optionally leaving identifier columns set.
unpivot(Array[Column], String, String)	Unpivots a `DataFrame` from wide format to long format, where values are set to all non-id columns.
where(Column)	Filters rows using the given `Column` condition. This is an alias for `filter`.
where(String)	Filters rows using the given SQL expression string.
withColumn(String, Column)	Returns a new Dataset by adding a column or replacing the existing column that has the same name.
withColumnRenamed(String, String)	Returns a new Dataset with a column renamed. This is a no-op if the schema doesn’t contain the existing name.
withColumns(Map[String, Column])	(Scala-specific) Returns a new Dataset by adding columns or replacing existing columns that have the same names.
withColumns(java.util.Map[String, Column])	(Java-specific) Returns a new Dataset by adding columns or replacing existing columns that have the same names.
withColumnsRenamed(Map[String, String])	(Scala-specific) Returns a new Dataset with columns renamed. This is a no-op if the schema doesn’t contain the existing name.
withColumnsRenamed(java.util.Map[String, String])	(Java-specific) Returns a new Dataset with columns renamed. This is a no-op if the schema doesn’t contain the existing name.
withMetadata(String, Metadata)	Returns a new Dataset by updating an existing column with metadata.
writeTo(String)	Creates a write configuration builder for v2 sources.

Attributes¶

Attribute	Description
columns	Returns all column names as an `Array[String]`.
dtypes	Returns all column names and their data types as an `Array[(String, String)]`.
encoder	The `Encoder[T]` for this Dataset’s type.
inputFiles	Returns a best-effort snapshot of the files that compose this Dataset as an `Array[String]`.
isLocal	Returns `true` if `collect` and `take` can be run locally (without any Spark executors).
isStreaming	Returns `true` if this Dataset contains one or more sources that continuously return data as it arrives.
na	Returns a `DataFrameNaFunctions` for working with missing data.
schema	Returns the schema of this Dataset as a `StructType`.
sparkSession	The `SparkSession` that created this Dataset.
stat	Returns a `DataFrameStatFunctions` for working with statistic functions.
storageLevel	Gets the Dataset’s current `StorageLevel`, or `StorageLevel.NONE` if not persisted.
write	Interface for saving the content of the non-streaming Dataset out into external storage. Returns a `DataFrameWriter[T]`.

Unsupported APIs¶

The following Dataset APIs are not currently supported in Snowpark Connect for Spark.

Method	Description
checkpoint()	Returns a checkpointed version of this Dataset.
checkpoint(Boolean)	Returns a checkpointed version of this Dataset, optionally eager.
dropDuplicatesWithinWatermark()	Returns a new Dataset with duplicate rows removed within watermark. Streaming only.
dropDuplicatesWithinWatermark(Seq[String])	(Scala-specific) Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only.
dropDuplicatesWithinWatermark(Array[String])	(Java-specific) Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only.
dropDuplicatesWithinWatermark(String, String*)	Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only.
isEmpty	Returns `true` if the Dataset is empty.
javaRDD	Returns the content of the Dataset as a `JavaRDD[T]`.
localCheckpoint()	Locally checkpoints a Dataset and returns the new Dataset.
localCheckpoint(Boolean)	Locally checkpoints a Dataset, optionally eager.
randomSplit(Array[Double])	Randomly splits this Dataset with the provided weights.
randomSplit(Array[Double], Long)	Randomly splits this Dataset with the provided weights and seed.
randomSplitAsList(Array[Double], Long)	Returns a `java.util.List` that contains randomly split Dataset with the provided weights.
queryExecution	The `QueryExecution` object for this Dataset.
rdd	Represents the content of the Dataset as an `RDD[T]`.
reduce(ReduceFunction[T])	(Java-specific) Reduces the elements of this Dataset using the specified binary function.
reduce((T, T) => T)	(Scala-specific) Reduces the elements of this Dataset using the specified binary function.
sortWithinPartitions(Column*)	Returns a new Dataset with each partition sorted by the given `Column` expressions.
sortWithinPartitions(String, String*)	Returns a new Dataset with each partition sorted by the given column names.
sqlContext	The legacy `SQLContext` for this Dataset.
toJSON	Returns the content of the Dataset as a `Dataset[String]` of JSON strings.
toJavaRDD	Returns the content of the Dataset as a `JavaRDD[T]`.
toLocalIterator()	Returns a `java.util.Iterator[T]` that contains all rows in this Dataset.
withWatermark(String, String)	Defines an event time watermark for this Dataset.
writeStream	Interface for saving the content of a streaming Dataset out into external storage. Returns a `DataStreamWriter[T]`.