DataFrame support for Snowpark Connect for Spark¶

Snowpark Connect for Spark provides compatibility with the PySpark 3.5.3 Spark Connect DataFrame API, allowing you to run Spark workloads on Snowflake. This page details which APIs are supported and their compatibility levels. The DataFrame API is shared across PySpark, Java, and Scala clients.

Compatibility level definitions¶

Full compatibility

APIs with full compatibility behave identically to native PySpark. You can use these APIs with confidence that results will match exactly.

High compatibility

APIs with high compatibility work correctly but might have minor differences:

Error message formatting might differ.
Output display format might vary (such as decimal precision, column name casing).
Edge cases might produce slightly different results.

Partial compatibility

APIs with partial compatibility are functional but have notable limitations:

Only a subset of functionality might be available.
Behavior might differ from PySpark in specific scenarios.
Additional configuration might be required.
Performance characteristics might differ.

Unsupported

APIs that are not currently implemented or cannot be supported on Snowflake.

Full compatibility APIs¶


Group	Method	Description
DataFrame	cache()	Persists the DataFrame with the default storage level (`MEMORY_AND_DISK`).
DataFrame	coalesce(numPartitions)	Returns a new DataFrame that has exactly `numPartitions` partitions.
DataFrame	collect()	Returns all the records as a list of `Row`.
DataFrame	count()	Returns the number of rows in this DataFrame.
DataFrame	crossJoin(other)	Returns the cartesian product with another DataFrame.
DataFrame	dropDuplicates([subset])	Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns.
DataFrame	drop_duplicates([subset])	Alias for `dropDuplicates`.
DataFrame	dropna([how, thresh, subset])	Returns a new DataFrame omitting rows with null values.
DataFrame	fillna(value[, subset])	Replaces null values with the specified value.
DataFrame	first()	Returns the first row as a `Row`.
DataFrame	head([n])	Returns the first `n` rows.
DataFrame	isEmpty()	Returns `True` if this DataFrame is empty.
DataFrame	join(other[, on, how])	Joins with another DataFrame, using the given join expression.
DataFrame	limit(num)	Limits the result count to the number specified.
DataFrame	melt(ids, values, …)	Unpivots a DataFrame from wide format to long format. Alias for `unpivot`.
DataFrame	offset(num)	Returns a new DataFrame by skipping the first `n` rows.
DataFrame	persist([storageLevel])	Sets the storage level to persist the contents of the DataFrame across operations.
DataFrame	repartitionByRange(numPartitions, …)	Returns a new DataFrame partitioned by the given partitioning expressions into `numPartitions` using range partitioning.
DataFrame	replace(to_replace[, value, subset])	Returns a new DataFrame replacing a value with another value.
DataFrame	select(*cols)	Projects a set of expressions and returns a new DataFrame.
DataFrame	show([n, truncate, vertical])	Prints the first `n` rows to the console.
DataFrame	tail(num)	Returns the last `num` rows as a list of `Row`.
DataFrame	take(num)	Returns the first `num` rows as a list of `Row`.
DataFrame	toDF(*cols)	Returns a new DataFrame that with new column names.
DataFrame	toLocalIterator([prefetchPartitions])	Returns an iterator that contains all of the rows in this DataFrame.
DataFrame	toPandas()	Returns the contents of this DataFrame as a Pandas `pandas.DataFrame`.
DataFrame	unionAll(other)	Returns a new DataFrame containing the union of rows in this and another DataFrame. Alias for `union`.
DataFrame	unpersist([blocking])	Marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk.
DataFrame	unpivot(ids, values, …)	Unpivots a DataFrame from wide format to long format, optionally leaving identifier columns set.
DataFrame	where(condition)	Alias for `filter`.
DataFrame	withColumnsRenamed(colsMap)	Returns a new DataFrame by renaming multiple columns. This is a no-op if the schema doesn’t contain the given column names.
Column	asc()	Returns a sort expression based on ascending order of the column.
Column	between(lowerBound, upperBound)	Checks if values of this expression are between the given columns.
Column	contains(other)	Contains the other element.
Column	desc()	Returns a sort expression based on descending order of the column.
Column	eqNullSafe(other)	Equality test that is safe for null values.
Column	getItem(key)	Gets an item at position `key` out of a list or dict.
Column	isNull()	Returns `True` if the current expression is null.
Column	isin(*cols)	Returns a boolean `Column` based on a match against the given values.
Column	like(other)	SQL LIKE expression.
Column	otherwise(value)	Evaluates a list of conditions and returns one of multiple possible result expressions.
Column	startswith(other)	String starts with.
Column	substr(startPos, length)	Returns a `Column` which is a substring of the column.
Column	when(condition, value)	Evaluates a list of conditions and returns one of multiple possible result expressions.
SparkSession	range(start[, end, step, numPartitions])	Creates a DataFrame with a single column named `id`.
SparkSession	sql(sqlQuery, args, **kwargs)	Returns a DataFrame representing the result of the given query.
SparkSession	table(tableName)	Returns the specified table as a DataFrame.
GroupedData	agg(*exprs)	Computes aggregates and returns the result as a DataFrame.
GroupedData	mean(*cols)	Computes average values for each numeric column for each group.
GroupedData	pivot(pivot_col[, values])	Pivots a column of the current DataFrame and performs the specified aggregation.
DataFrameReader	table(tableName)	Returns the specified table as a DataFrame.
DataFrameWriter	mode(saveMode)	Specifies the behavior when data or table already exist.
DataFrameWriter	saveAsTable(name[, format, mode, …])	Saves the content of the DataFrame as the specified table.
DataFrameWriter	text(path[, compression, lineSep])	Saves the content of the DataFrame in a text file at the specified path.
DataFrameWriterV2	replace()	Replaces data in the existing table.
Catalog	cacheTable(tableName[, storageLevel])	Caches the specified table in-memory.
Catalog	clearCache()	Removes all cached tables from the in-memory cache.
Catalog	dropGlobalTempView(viewName)	Drops the global temporary view with the given view name.
Catalog	dropTempView(viewName)	Drops the local temporary view with the given view name.
Catalog	isCached(tableName)	Returns `True` if the table is currently cached in-memory.
Catalog	refreshByPath(path)	Invalidates and refreshes all the cached data for any DataFrame that contains the given data source path.
Catalog	refreshTable(tableName)	Invalidates and refreshes all the cached data and metadata of the given table.
Catalog	uncacheTable(tableName)	Removes the specified table from the in-memory cache.
Window	partitionBy(*cols)	Creates a `WindowSpec` with the partitioning defined.
Window	orderBy(*cols)	Creates a `WindowSpec` with the ordering defined.
Window	rangeBetween(start, end)	Creates a `WindowSpec` with the frame boundaries defined, from `start` (inclusive) to `end` (inclusive).
Window	rowsBetween(start, end)	Creates a `WindowSpec` with the frame boundaries defined, from `start` (inclusive) to `end` (inclusive).
Window	`unboundedPreceding`	Value representing the first row in the partition, for use in frame boundary definition.
Window	`unboundedFollowing`	Value representing the last row in the partition, for use in frame boundary definition.
Window	`currentRow`	Value representing the current row, for use in frame boundary definition.
WindowSpec	partitionBy(*cols)	Defines the partitioning columns in a `WindowSpec`.
WindowSpec	orderBy(*cols)	Defines the ordering columns in a `WindowSpec`.
WindowSpec	rangeBetween(start, end)	Defines the frame boundaries, from `start` (inclusive) to `end` (inclusive).
WindowSpec	rowsBetween(start, end)	Defines the frame boundaries, from `start` (inclusive) to `end` (inclusive).

High compatibility APIs¶

APIs with high compatibility work correctly but might have minor differences in error messages, output format, or edge cases.


Group	Method	Description
DataFrame	agg(*exprs)	Aggregates on the entire DataFrame without groups (shorthand for `df.groupBy().agg()`).
DataFrame	colRegex(colName)	Selects column based on the column name specified as a regex and returns it as a `Column`.
DataFrame	corr(col1, col2[, method])	Calculates the correlation of two columns as a `float`.
DataFrame	cov(col1, col2)	Calculates the sample covariance for the given columns as a `float`.
DataFrame	crosstab(col1, col2)	Computes a pair-wise frequency table of the given columns.
DataFrame	cube(*cols)	Creates a multi-dimensional cube for the current DataFrame using the specified columns, for running aggregations.
DataFrame	describe(*cols)	Computes basic statistics for numeric and string columns.
DataFrame	distinct()	Returns a new DataFrame containing the distinct rows in this DataFrame.
DataFrame	drop(*cols)	Returns a new DataFrame without the specified columns.
DataFrame	exceptAll(other)	Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.
DataFrame	groupBy(*cols)	Groups the DataFrame using the specified columns, returning a `GroupedData` object for running aggregations.
DataFrame	groupby(*cols)	Alias for `groupBy`.
DataFrame	intersect(other)	Returns a new DataFrame containing rows only in both this DataFrame and another DataFrame.
DataFrame	intersectAll(other)	Returns a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.
DataFrame	isLocal()	Returns `True` if `collect` and `take` can be run locally.
DataFrame	mapInPandas(func, schema)	Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a `pandas.DataFrame`.
DataFrame	orderBy(cols, *kwargs)	Returns a new DataFrame sorted by the specified columns.
DataFrame	rollup(*cols)	Creates a multi-dimensional rollup for the current DataFrame using the specified columns.
DataFrame	sort(cols, *kwargs)	Returns a new DataFrame sorted by the specified columns.
DataFrame	union(other)	Returns a new DataFrame containing the union of rows in this and another DataFrame.
DataFrame	unionByName(other[, allowMissingColumns])	Returns a new DataFrame containing the union of rows, resolving columns by name.
DataFrame	withColumn(colName, col)	Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Column	alias(alias, *kwargs)	Returns this column aliased with a new name or names.
Column	asc_nulls_first()	Returns a sort expression based on ascending order with null values returned before non-null values.
Column	asc_nulls_last()	Returns a sort expression based on ascending order with null values returned after non-null values.
Column	astype(dataType)	Casts the column into the specified type. Alias for `cast`.
Column	bitwiseAND(other)	Computes bitwise AND of this expression with another expression.
Column	bitwiseOR(other)	Computes bitwise OR of this expression with another expression.
Column	bitwiseXOR(other)	Computes bitwise XOR of this expression with another expression.
Column	cast(dataType)	Casts the column into the specified type.
Column	desc_nulls_first()	Returns a sort expression based on descending order with null values returned before non-null values.
Column	desc_nulls_last()	Returns a sort expression based on descending order with null values returned after non-null values.
Column	endswith(other)	String ends with.
Column	isNotNull()	Returns `True` if the current expression is not null.
SparkSession	createDataFrame(data[, schema, …])	Creates a DataFrame from an RDD, a list, or a `pandas.DataFrame`.
DataFrameReader	csv(path[, schema, sep, …])	Loads a CSV file and returns the result as a DataFrame.
Catalog	currentCatalog()	Returns the current default catalog.
Catalog	listCatalogs([pattern])	Returns a list of catalogs available.
Catalog	listColumns(tableName[, dbName])	Returns a list of columns for the given table/view.
Catalog	recoverPartitions(tableName)	Recovers all the partitions of the given table.
Catalog	setCurrentCatalog(catalogName)	Sets the current default catalog.

Note

DataFrame: orderBy / sort: Column ordering is inferred from the last DataFrame in the chain.
DataFrame: union / unionByName: Type widening behavior might differ slightly.
DataFrame: describe: Statistical output format might vary.
Column: cast: Some invalid casts return NULL in Spark but error in Snowpark.
Column: alias: Struct field display format might differ.
SparkSession: createDataFrame: Schema inference might produce different types (such as NUMBER(38,0) vs LONG).
Catalog: listColumns: Column names are uppercase, types are Snowflake-specific. Error messages might differ in format.

Partial compatibility APIs¶

APIs with partial compatibility are functional but have notable limitations. Behavior might differ from PySpark in specific scenarios.


Group	Method	Description
DataFrame	alias(alias)	Returns a new DataFrame with an alias set.
DataFrame	approxQuantile(col, probabilities, relativeError)	Calculates the approximate quantiles of numerical columns of a DataFrame.
DataFrame	createGlobalTempView(name)	Creates a global temporary view with this DataFrame.
DataFrame	createOrReplaceGlobalTempView(name)	Creates or replaces a global temporary view using this DataFrame.
DataFrame	createOrReplaceTempView(name)	Creates or replaces a local temporary view with this DataFrame.
DataFrame	createTempView(name)	Creates a local temporary view with this DataFrame.
DataFrame	explain([extended, mode])	Prints the (logical and physical) plans to the console for debugging purposes.
DataFrame	filter(condition)	Filters rows using the given condition.
DataFrame	freqItems(cols[, support])	Finds all items which have a frequency greater than or equal to a fraction of the total number of rows.
DataFrame	hint(name, *parameters)	Specifies some hint on the current DataFrame.
DataFrame	inputFiles()	Returns a best-effort snapshot of the files that compose this DataFrame.
DataFrame	printSchema([level])	Prints out the schema in tree format.
DataFrame	randomSplit(weights[, seed])	Randomly splits this DataFrame into separate DataFrames with the given weights.
DataFrame	repartition(numPartitions, *cols)	Returns a new DataFrame partitioned by the given partitioning expressions.
DataFrame	sameSemantics(other)	Returns `True` when the logical query plans inside both DataFrames are equal and therefore return the same results.
DataFrame	sample([withReplacement, …])	Returns a sampled subset of this DataFrame.
DataFrame	sampleBy(col, fractions[, seed])	Returns a stratified sample without replacement based on the fraction given on each stratum.
DataFrame	selectExpr(*expr)	Projects a set of SQL expressions and returns a new DataFrame.
DataFrame	semanticHash()	Returns a hash code of the logical query plan against this DataFrame.
DataFrame	sortWithinPartitions(*cols, …)	Returns a new DataFrame with each partition sorted by the specified columns.
DataFrame	subtract(other)	Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame.
DataFrame	summary(*statistics)	Computes specified statistics for numeric and string columns.
DataFrame	transform(func, args, *kwargs)	Returns a new DataFrame by applying a chain of custom transformations.
DataFrame	withColumns(*colsMap)	Returns a new DataFrame by adding multiple columns or replacing existing columns that have the same names.
DataFrame	withMetadata(columnName, metadata)	Returns a new DataFrame by updating an existing column with metadata.
Column	dropFields(*fieldNames)	Returns a new `Column` with the specified nested fields dropped.
Column	ilike(other)	SQL ILIKE expression (case-insensitive LIKE).
Column	over(window)	Defines a windowing column.
Column	rlike(other)	SQL RLIKE expression (regex match).
Column	withField(fieldName, col)	Returns a new `Column` with a field added or replaced in a `StructType`.
SparkSession	addArtifact(*path)	Adds an artifact to the session.
SparkSession	addArtifacts(*path)	Adds artifacts to the session.
SparkSession	addTag(tag)	Adds a tag to be assigned to all operations started by this thread in this session.
SparkSession	clearTags()	Clears the current thread’s operation tags.
SparkSession	getTags()	Returns the operation tags that are currently set to be assigned to all operations started by this thread.
SparkSession	interruptAll()	Interrupts all operations of this session currently running on the connected server.
SparkSession	interruptOperation(op_id)	Interrupts an operation of this session with the given operation ID.
SparkSession	interruptTag(tag)	Interrupts all operations of this session with the given operation tag.
SparkSession	removeTag(tag)	Removes a tag previously added to be assigned to all operations started by this thread in this session.
GroupedData	apply(udf)	Maps each group of the current DataFrame using a pandas UDF.
GroupedData	avg(*cols)	Computes average values for each numeric column for each group.
GroupedData	sum(*cols)	Computes the sum for each numeric column for each group.
DataFrameReader	json(path[, schema, …])	Loads JSON files and returns the result as a DataFrame.
DataFrameReader	load([path, format, schema, …])	Loads data from a data source and returns it as a DataFrame.
DataFrameReader	parquet(paths, *options)	Loads Parquet files, returning the result as a DataFrame.
DataFrameReader	jdbc(url, table[, column, …])	Constructs a DataFrame representing the database table accessible via JDBC URL.
DataFrameWriter	csv(path, mode, …)	Saves the content of the DataFrame in CSV format at the specified path.
DataFrameWriter	json(path, mode, …)	Saves the content of the DataFrame in JSON format at the specified path.
DataFrameWriter	options(**options)	Adds output options for the underlying data source.
DataFrameWriter	parquet(path, mode, …)	Saves the content of the DataFrame in Parquet format at the specified path.
DataFrameWriterV2	append()	Appends the contents of the DataFrame to the output table.
DataFrameWriterV2	create()	Creates a new table from the contents of the DataFrame.
DataFrameWriterV2	createOrReplace()	Creates a new table or replaces an existing table with the contents of the DataFrame.
DataFrameWriterV2	option(key, value)	Adds a write option.
DataFrameWriterV2	options(**options)	Adds write options.
DataFrameWriterV2	partitionedBy(col, *cols)	Specifies a partitioning column.
DataFrameWriterV2	tableProperty(property, value)	Adds a table property.
DataFrameWriterV2	using(provider)	Specifies a provider for the underlying output data source.

Note

DataFrame: explain: Query plan format differs from Spark.
DataFrame: repartition: Partition count might not be exact.
DataFrame: sample: Random sampling implementation differs.
DataFrame: createTempView: View lifecycle might differ.
Column: over: Window frame specifications might have subtle differences.
Column: rlike: Regex syntax follows Snowflake conventions.
SparkSession: Tags are mapped to Snowflake query tags. Interrupt operations use Snowflake query IDs instead of operation IDs.
DataFrameReader: File paths use Snowflake stages or cloud storage (S3, GCS, Azure). Schema inference might differ from native Spark. Some format-specific options might not be supported.
DataFrameWriter: Writes go to Snowflake stages or cloud storage. Partitioning behavior might differ.

Unsupported APIs¶

The following APIs are not currently supported in Snowpark Connect for Spark.


Group	Method	Description
DataFrame	dropDuplicatesWithinWatermark([subset])	Returns a new DataFrame with duplicate rows removed within watermark. Streaming only.
DataFrame	observe(observation, *exprs)	Defines (named) metrics to observe on the DataFrame.
DataFrame	pandas_api([index_col])	Converts the existing DataFrame into a pandas-on-Spark DataFrame.
DataFrame	registerTempTable(name)	Registers this DataFrame as a temporary table. Deprecated since 2.0; use `createOrReplaceTempView` instead.
DataFrame	to_pandas_on_spark([index_col])	Converts the existing DataFrame into a pandas-on-Spark DataFrame. Alias for `pandas_api`.
DataFrame	withWatermark(eventTime, delayThreshold)	Defines an event time watermark for this DataFrame.
SparkSession	copyFromLocalToFs(local_path, dest_path)	Copies a local file to a remote filesystem.
SparkSession	stop()	Stops the underlying `SparkContext`.
GroupedData	applyInPandasWithState(func, …)	Applies a function to each group of data using pandas with state.
GroupedData	cogroup(other)	Cogroups this group with another group.
DataFrameReader	orc(path, …)	Loads ORC files, returning the result as a DataFrame.
DataFrameWriter	bucketBy(numBuckets, col, *cols)	Buckets the output by the given columns.
DataFrameWriter	insertInto(tableName[, overwrite])	Inserts the content of the DataFrame to the specified table.
DataFrameWriter	jdbc(url, table[, mode, properties])	Saves the content of the DataFrame to an external database table via JDBC.
DataFrameWriter	orc(path, mode, …)	Saves the content of the DataFrame in ORC format at the specified path.
DataFrameWriter	sortBy(col, *cols)	Specifies sorting columns for each output partition.
Catalog	createExternalTable(tableName, …)	Creates a table based on the dataset in a data source.
Catalog	createTable(tableName, …)	Creates a table based on the dataset in a data source.
Catalog	functionExists(functionName[, dbName])	Checks if the function with the specified name exists.
Catalog	getFunction(functionName)	Gets the function with the specified name.
Catalog	listFunctions([dbName])	Returns a list of functions registered in the specified database.
Catalog	registerFunction(name, f, returnType)	Registers a Python function as a UDF.