Dataset support for Snowpark Connect for Spark (Java/Scala)

class Dataset[T] extends Serializable

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.

Operations available on Datasets are divided into transformations and actions. Transformations produce new Datasets, and actions trigger computation and return results. Example transformations include map, filter, select, and groupBy. Example actions include count, show, and collect.

Snowpark Connect for Spark supports the Spark 3.5 Dataset API for Java and Scala. There is only one JVM client, so there’s no significant support difference between using Java or Scala. All supported and unsupported APIs are described in this topic.

For detailed Dataframe API support, see DataFrame support for Snowpark Connect for Spark.

Methods

The following table lists all Dataset methods and their support status in Snowpark Connect for Spark.

Method

Description

agg(Column, Column*)

Aggregates on the entire Dataset without groups. Shorthand for groupBy().agg().

agg(Map[String, String])

(Scala-specific) Aggregates on the entire Dataset without groups, using a map of column names to aggregate functions.

agg(java.util.Map[String, String])

(Java-specific) Aggregates on the entire Dataset without groups, using a map of column names to aggregate functions.

agg((String, String), (String, String)*)

(Scala-specific) Aggregates on the entire Dataset without groups, using pairs of column names and aggregate function names.

alias(String)

Returns a new Dataset with an alias set. Same as as.

alias(Symbol)

(Scala-specific) Returns a new Dataset with an alias set.

apply(String)

Selects column based on the column name and returns it as a Column.

as(String)

Returns a new Dataset with an alias set.

as(Symbol)

(Scala-specific) Returns a new Dataset with an alias set.

as[U](Encoder[U])

Returns a new Dataset where each record has been mapped on to the specified type U.

cache()

Persists this Dataset with the default storage level (MEMORY_AND_DISK).

coalesce(Int)

Returns a new Dataset that has exactly numPartitions partitions, when fewer partitions are requested.

col(String)

Selects column based on the column name and returns it as a Column.

colRegex(String)

Selects column based on the column name specified as a regex and returns it as a Column.

collect()

Returns an Array[T] that contains all rows in this Dataset.

collectAsList()

Returns a java.util.List[T] that contains all rows in this Dataset.

count()

Returns the number of rows in the Dataset as a Long.

createGlobalTempView(String)

Creates a global temporary view using the given name.

createOrReplaceGlobalTempView(String)

Creates or replaces a global temporary view using the given name.

createOrReplaceTempView(String)

Creates a local temporary view using the given name.

createTempView(String)

Creates a local temporary view using the given name.

crossJoin(Dataset[_])

Explicit cartesian join with another DataFrame.

cube(String, String*)

Creates a multi-dimensional cube for the current Dataset using column names for running aggregations.

cube(Column*)

Creates a multi-dimensional cube for the current Dataset using Column expressions for running aggregations.

describe(String*)

Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.

distinct()

Returns a new Dataset that contains only the unique rows. This is an alias for dropDuplicates.

drop(String)

Returns a new Dataset with a column dropped by name. This is a no-op if the schema doesn’t contain the column name.

drop(String*)

Returns a new Dataset with multiple columns dropped by name.

drop(Column)

Returns a new Dataset with a column dropped. Accepts a Column rather than a name.

drop(Column, Column*)

Returns a new Dataset with multiple columns dropped using Column expressions.

dropDuplicates()

Returns a new Dataset with duplicate rows removed.

dropDuplicates(Seq[String])

(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

dropDuplicates(Array[String])

(Java-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

dropDuplicates(String, String*)

Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

except(Dataset[T])

Returns a new Dataset containing rows in this Dataset but not in another Dataset. Equivalent to EXCEPT DISTINCT in SQL.

exceptAll(Dataset[T])

Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving duplicates. Equivalent to EXCEPT ALL in SQL.

explain()

Prints the physical plan to the console for debugging purposes.

explain(Boolean)

Prints the plans (logical and physical) to the console for debugging purposes.

explain(String)

Prints the plans with a format specified by a given explain mode (simple, extended, codegen, cost, formatted).

filter(Column)

Filters rows using the given Column condition.

filter(String)

Filters rows using the given SQL expression string.

filter(FilterFunction[T])

(Java-specific) Returns a new Dataset that only contains elements where func returns true.

filter(T => Boolean)

(Scala-specific) Returns a new Dataset that only contains elements where func returns true.

first()

Returns the first row. Alias for head().

flatMap[U](FlatMapFunction[T, U], Encoder[U])

(Java-specific) Returns a new Dataset by first applying a function to all elements and then flattening the results.

flatMap[U](T => TraversableOnce[U])(Encoder[U])

(Scala-specific) Returns a new Dataset by first applying a function to all elements and then flattening the results.

foreach(ForeachFunction[T])

(Java-specific) Runs func on each element of this Dataset.

foreach(T => Unit)

(Scala-specific) Applies a function to all rows.

foreachPartition(ForeachPartitionFunction[T])

(Java-specific) Runs func on each partition of this Dataset.

foreachPartition(Iterator[T] => Unit)

(Scala-specific) Applies a function to each partition of this Dataset.

groupBy(Column*)

Groups the Dataset using the specified Column expressions for running aggregations.

groupBy(String, String*)

Groups the Dataset using the specified column names for running aggregations.

groupByKey[K](MapFunction[T, K], Encoder[K])

(Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key function.

groupByKey[K](T => K)(Encoder[K])

(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key function.

head()

Returns the first row.

head(Int)

Returns the first n rows as an Array[T].

hint(String, Any*)

Specifies some hint on the current Dataset (for example, broadcast hint for joins).

intersect(Dataset[T])

Returns a new Dataset containing rows only in both this Dataset and another Dataset. Equivalent to INTERSECT in SQL.

intersectAll(Dataset[T])

Returns a new Dataset containing rows only in both Datasets while preserving duplicates. Equivalent to INTERSECT ALL in SQL.

join(Dataset[_])

Joins with another DataFrame. Behaves as an inner join and requires a subsequent join predicate.

join(Dataset[_], String)

Inner equi-join with another DataFrame using the given column name.

join(Dataset[_], Seq[String])

(Scala-specific) Inner equi-join with another DataFrame using the given column names.

join(Dataset[_], Array[String])

(Java-specific) Inner equi-join with another DataFrame using the given column names.

join(Dataset[_], String, String)

Equi-join with another DataFrame using the given column name and join type.

join(Dataset[_], Seq[String], String)

(Scala-specific) Equi-join with another DataFrame using the given column names and join type.

join(Dataset[_], Array[String], String)

(Java-specific) Equi-join with another DataFrame using the given column names and join type.

join(Dataset[_], Column)

Inner join with another DataFrame using the given join expression.

join(Dataset[_], Column, String)

Joins with another DataFrame using the given join expression and join type.

joinWith[U](Dataset[U], Column)

Inner equi-join to join this Dataset, returning a Dataset[(T, U)] for each pair where the condition evaluates to true.

joinWith[U](Dataset[U], Column, String)

Joins this Dataset returning a Dataset[(T, U)] for each pair where the condition evaluates to true, using the specified join type.

limit(Int)

Returns a new Dataset by taking the first n rows.

map[U](MapFunction[T, U], Encoder[U])

(Java-specific) Returns a new Dataset that contains the result of applying func to each element.

map[U](T => U)(Encoder[U])

(Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

mapPartitions[U](MapPartitionsFunction[T, U], Encoder[U])

(Java-specific) Returns a new Dataset that contains the result of applying func to each partition.

mapPartitions[U](Iterator[T] => Iterator[U])(Encoder[U])

(Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

melt(Array[Column], Array[Column], String, String)

Unpivots a DataFrame from wide format to long format. This is an alias for unpivot.

melt(Array[Column], String, String)

Unpivots a DataFrame from wide format to long format, where values are set to all non-id columns.

metadataColumn(String)

Selects a metadata column based on its logical column name and returns it as a Column.

observe(String, Column, Column*)

Defines named metrics to observe on the Dataset.

observe(Observation, Column, Column*)

Observes named metrics through an Observation instance.

offset(Int)

Returns a new Dataset by skipping the first n rows.

orderBy(Column*)

Returns a new Dataset sorted by the given Column expressions. This is an alias for sort.

orderBy(String, String*)

Returns a new Dataset sorted by the given column names.

persist()

Persists this Dataset with the default storage level (MEMORY_AND_DISK).

persist(StorageLevel)

Persists this Dataset with the given StorageLevel.

printSchema()

Prints the schema to the console in a nice tree format.

printSchema(Int)

Prints the schema up to the given level to the console in a nice tree format.

repartition(Int)

Returns a new Dataset that has exactly numPartitions partitions.

repartition(Int, Column*)

Returns a new Dataset hash-partitioned by the given Column expressions into numPartitions.

repartition(Column*)

Returns a new Dataset hash-partitioned by the given partitioning Column expressions.

repartitionByRange(Int, Column*)

Returns a new Dataset range-partitioned by the given Column expressions into numPartitions.

repartitionByRange(Column*)

Returns a new Dataset range-partitioned by the given partitioning Column expressions.

rollup(String, String*)

Creates a multi-dimensional rollup for the current Dataset using column names for running aggregations.

rollup(Column*)

Creates a multi-dimensional rollup for the current Dataset using Column expressions for running aggregations.

sameSemantics(Dataset[T])

Returns true when the logical query plans inside both Datasets are equal and therefore return same results.

sample(Double)

Returns a new Dataset by sampling a fraction of rows (without replacement).

sample(Double, Long)

Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.

sample(Boolean, Double)

Returns a new Dataset by sampling a fraction of rows, using a random seed.

sample(Boolean, Double, Long)

Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.

select(Column*)

Selects a set of column-based expressions.

select(String, String*)

Selects a set of columns by name.

select[U1](TypedColumn[T, U1])

Returns a new Dataset by computing the given TypedColumn expression for each element.

selectExpr(String*)

Selects a set of SQL expressions. This is a variant of select that accepts SQL expression strings.

semanticHash()

Returns a hashCode of the logical query plan against this Dataset.

show()

Displays the top 20 rows of the Dataset in a tabular form.

show(Int)

Displays the Dataset in a tabular form, showing numRows rows.

show(Boolean)

Displays the top 20 rows with truncation control.

show(Int, Boolean)

Displays the Dataset in a tabular form with truncation control.

show(Int, Int)

Displays the Dataset in a tabular form with truncation to a specific character count.

show(Int, Int, Boolean)

Displays the Dataset in a tabular form with truncation and vertical display options.

sort(Column*)

Returns a new Dataset sorted by the given Column expressions.

sort(String, String*)

Returns a new Dataset sorted by the specified column names, all in ascending order.

summary(String*)

Computes specified statistics for numeric and string columns. Available statistics include count, mean, stddev, min, max, arbitrary percentiles, count_distinct, and approx_count_distinct.

tail(Int)

Returns the last n rows in the Dataset as an Array[T].

take(Int)

Returns the first n rows in the Dataset as an Array[T].

takeAsList(Int)

Returns the first n rows in the Dataset as a java.util.List[T].

to(StructType)

Returns a new DataFrame where each row is reconciled to match the specified schema.

toDF()

Converts this strongly typed collection of data to a generic DataFrame.

toDF(String*)

Converts this strongly typed collection of data to a generic DataFrame with columns renamed.

transform[U](Dataset[T] => Dataset[U])

Concise syntax for chaining custom transformations.

union(Dataset[T])

Returns a new Dataset containing the union of rows in this Dataset and another Dataset. Equivalent to UNION ALL in SQL. Resolves columns by position.

unionAll(Dataset[T])

Returns a new Dataset containing the union of rows in this Dataset and another Dataset. This is an alias for union.

unionByName(Dataset[T])

Returns a new Dataset containing the union of rows in this Dataset and another Dataset. Resolves columns by name (not by position).

unionByName(Dataset[T], Boolean)

Returns a new Dataset containing the union of rows, with support for missing columns. Missing columns are filled with null.

unpersist()

Marks the Dataset as non-persistent and removes all blocks for it from memory and disk.

unpersist(Boolean)

Marks the Dataset as non-persistent, optionally blocking until all blocks are deleted.

unpivot(Array[Column], Array[Column], String, String)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier columns set.

unpivot(Array[Column], String, String)

Unpivots a DataFrame from wide format to long format, where values are set to all non-id columns.

where(Column)

Filters rows using the given Column condition. This is an alias for filter.

where(String)

Filters rows using the given SQL expression string.

withColumn(String, Column)

Returns a new Dataset by adding a column or replacing the existing column that has the same name.

withColumnRenamed(String, String)

Returns a new Dataset with a column renamed. This is a no-op if the schema doesn’t contain the existing name.

withColumns(Map[String, Column])

(Scala-specific) Returns a new Dataset by adding columns or replacing existing columns that have the same names.

withColumns(java.util.Map[String, Column])

(Java-specific) Returns a new Dataset by adding columns or replacing existing columns that have the same names.

withColumnsRenamed(Map[String, String])

(Scala-specific) Returns a new Dataset with columns renamed. This is a no-op if the schema doesn’t contain the existing name.

withColumnsRenamed(java.util.Map[String, String])

(Java-specific) Returns a new Dataset with columns renamed. This is a no-op if the schema doesn’t contain the existing name.

withMetadata(String, Metadata)

Returns a new Dataset by updating an existing column with metadata.

writeTo(String)

Creates a write configuration builder for v2 sources.

Attributes

Attribute

Description

columns

Returns all column names as an Array[String].

dtypes

Returns all column names and their data types as an Array[(String, String)].

encoder

The Encoder[T] for this Dataset’s type.

inputFiles

Returns a best-effort snapshot of the files that compose this Dataset as an Array[String].

isLocal

Returns true if collect and take can be run locally (without any Spark executors).

isStreaming

Returns true if this Dataset contains one or more sources that continuously return data as it arrives.

na

Returns a DataFrameNaFunctions for working with missing data.

schema

Returns the schema of this Dataset as a StructType.

sparkSession

The SparkSession that created this Dataset.

stat

Returns a DataFrameStatFunctions for working with statistic functions.

storageLevel

Gets the Dataset’s current StorageLevel, or StorageLevel.NONE if not persisted.

write

Interface for saving the content of the non-streaming Dataset out into external storage. Returns a DataFrameWriter[T].

Unsupported APIs

The following Dataset APIs are not currently supported in Snowpark Connect for Spark.

Method

Description

checkpoint()

Returns a checkpointed version of this Dataset.

checkpoint(Boolean)

Returns a checkpointed version of this Dataset, optionally eager.

dropDuplicatesWithinWatermark()

Returns a new Dataset with duplicate rows removed within watermark. Streaming only.

dropDuplicatesWithinWatermark(Seq[String])

(Scala-specific) Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only.

dropDuplicatesWithinWatermark(Array[String])

(Java-specific) Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only.

dropDuplicatesWithinWatermark(String, String*)

Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only.

isEmpty

Returns true if the Dataset is empty.

javaRDD

Returns the content of the Dataset as a JavaRDD[T].

localCheckpoint()

Locally checkpoints a Dataset and returns the new Dataset.

localCheckpoint(Boolean)

Locally checkpoints a Dataset, optionally eager.

randomSplit(Array[Double])

Randomly splits this Dataset with the provided weights.

randomSplit(Array[Double], Long)

Randomly splits this Dataset with the provided weights and seed.

randomSplitAsList(Array[Double], Long)

Returns a java.util.List that contains randomly split Dataset with the provided weights.

queryExecution

The QueryExecution object for this Dataset.

rdd

Represents the content of the Dataset as an RDD[T].

reduce(ReduceFunction[T])

(Java-specific) Reduces the elements of this Dataset using the specified binary function.

reduce((T, T) => T)

(Scala-specific) Reduces the elements of this Dataset using the specified binary function.

sortWithinPartitions(Column*)

Returns a new Dataset with each partition sorted by the given Column expressions.

sortWithinPartitions(String, String*)

Returns a new Dataset with each partition sorted by the given column names.

sqlContext

The legacy SQLContext for this Dataset.

toJSON

Returns the content of the Dataset as a Dataset[String] of JSON strings.

toJavaRDD

Returns the content of the Dataset as a JavaRDD[T].

toLocalIterator()

Returns a java.util.Iterator[T] that contains all rows in this Dataset.

withWatermark(String, String)

Defines an event time watermark for this Dataset.

writeStream

Interface for saving the content of a streaming Dataset out into external storage. Returns a DataStreamWriter[T].