DataFrame support for Snowpark Connect for Spark

Snowpark Connect for Spark provides compatibility with the PySpark 3.5.3 Spark Connect DataFrame API, allowing you to run Spark workloads on Snowflake. This page details which APIs are supported and their compatibility levels. The DataFrame API is shared across PySpark, Java, and Scala clients.

Compatibility level definitions

Full compatibility

APIs with full compatibility behave identically to native PySpark. You can use these APIs with confidence that results will match exactly.

High compatibility

APIs with high compatibility work correctly but might have minor differences:

  • Error message formatting might differ.

  • Output display format might vary (such as decimal precision, column name casing).

  • Edge cases might produce slightly different results.

Partial compatibility

APIs with partial compatibility are functional but have notable limitations:

  • Only a subset of functionality might be available.

  • Behavior might differ from PySpark in specific scenarios.

  • Additional configuration might be required.

  • Performance characteristics might differ.

Unsupported

APIs that are not currently implemented or cannot be supported on Snowflake.

Full compatibility APIs

Group

Method

Description

DataFrame

cache()

Persists the DataFrame with the default storage level (MEMORY_AND_DISK).

DataFrame

coalesce(numPartitions)

Returns a new DataFrame that has exactly numPartitions partitions.

DataFrame

collect()

Returns all the records as a list of Row.

DataFrame

count()

Returns the number of rows in this DataFrame.

DataFrame

crossJoin(other)

Returns the cartesian product with another DataFrame.

DataFrame

dropDuplicates([subset])

Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns.

DataFrame

drop_duplicates([subset])

Alias for dropDuplicates.

DataFrame

dropna([how, thresh, subset])

Returns a new DataFrame omitting rows with null values.

DataFrame

fillna(value[, subset])

Replaces null values with the specified value.

DataFrame

first()

Returns the first row as a Row.

DataFrame

head([n])

Returns the first n rows.

DataFrame

isEmpty()

Returns True if this DataFrame is empty.

DataFrame

join(other[, on, how])

Joins with another DataFrame, using the given join expression.

DataFrame

limit(num)

Limits the result count to the number specified.

DataFrame

melt(ids, values, …)

Unpivots a DataFrame from wide format to long format. Alias for unpivot.

DataFrame

offset(num)

Returns a new DataFrame by skipping the first n rows.

DataFrame

persist([storageLevel])

Sets the storage level to persist the contents of the DataFrame across operations.

DataFrame

repartitionByRange(numPartitions, …)

Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions using range partitioning.

DataFrame

replace(to_replace[, value, subset])

Returns a new DataFrame replacing a value with another value.

DataFrame

select(*cols)

Projects a set of expressions and returns a new DataFrame.

DataFrame

show([n, truncate, vertical])

Prints the first n rows to the console.

DataFrame

tail(num)

Returns the last num rows as a list of Row.

DataFrame

take(num)

Returns the first num rows as a list of Row.

DataFrame

toDF(*cols)

Returns a new DataFrame that with new column names.

DataFrame

toLocalIterator([prefetchPartitions])

Returns an iterator that contains all of the rows in this DataFrame.

DataFrame

toPandas()

Returns the contents of this DataFrame as a Pandas pandas.DataFrame.

DataFrame

unionAll(other)

Returns a new DataFrame containing the union of rows in this and another DataFrame. Alias for union.

DataFrame

unpersist([blocking])

Marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk.

DataFrame

unpivot(ids, values, …)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier columns set.

DataFrame

where(condition)

Alias for filter.

DataFrame

withColumnsRenamed(colsMap)

Returns a new DataFrame by renaming multiple columns. This is a no-op if the schema doesn’t contain the given column names.

Column

asc()

Returns a sort expression based on ascending order of the column.

Column

between(lowerBound, upperBound)

Checks if values of this expression are between the given columns.

Column

contains(other)

Contains the other element.

Column

desc()

Returns a sort expression based on descending order of the column.

Column

eqNullSafe(other)

Equality test that is safe for null values.

Column

getItem(key)

Gets an item at position key out of a list or dict.

Column

isNull()

Returns True if the current expression is null.

Column

isin(*cols)

Returns a boolean Column based on a match against the given values.

Column

like(other)

SQL LIKE expression.

Column

otherwise(value)

Evaluates a list of conditions and returns one of multiple possible result expressions.

Column

startswith(other)

String starts with.

Column

substr(startPos, length)

Returns a Column which is a substring of the column.

Column

when(condition, value)

Evaluates a list of conditions and returns one of multiple possible result expressions.

SparkSession

range(start[, end, step, numPartitions])

Creates a DataFrame with a single column named id.

SparkSession

sql(sqlQuery, args, **kwargs)

Returns a DataFrame representing the result of the given query.

SparkSession

table(tableName)

Returns the specified table as a DataFrame.

GroupedData

agg(*exprs)

Computes aggregates and returns the result as a DataFrame.

GroupedData

mean(*cols)

Computes average values for each numeric column for each group.

GroupedData

pivot(pivot_col[, values])

Pivots a column of the current DataFrame and performs the specified aggregation.

DataFrameReader

table(tableName)

Returns the specified table as a DataFrame.

DataFrameWriter

mode(saveMode)

Specifies the behavior when data or table already exist.

DataFrameWriter

saveAsTable(name[, format, mode, …])

Saves the content of the DataFrame as the specified table.

DataFrameWriter

text(path[, compression, lineSep])

Saves the content of the DataFrame in a text file at the specified path.

DataFrameWriterV2

replace()

Replaces data in the existing table.

Catalog

cacheTable(tableName[, storageLevel])

Caches the specified table in-memory.

Catalog

clearCache()

Removes all cached tables from the in-memory cache.

Catalog

dropGlobalTempView(viewName)

Drops the global temporary view with the given view name.

Catalog

dropTempView(viewName)

Drops the local temporary view with the given view name.

Catalog

isCached(tableName)

Returns True if the table is currently cached in-memory.

Catalog

refreshByPath(path)

Invalidates and refreshes all the cached data for any DataFrame that contains the given data source path.

Catalog

refreshTable(tableName)

Invalidates and refreshes all the cached data and metadata of the given table.

Catalog

uncacheTable(tableName)

Removes the specified table from the in-memory cache.

Window

partitionBy(*cols)

Creates a WindowSpec with the partitioning defined.

Window

orderBy(*cols)

Creates a WindowSpec with the ordering defined.

Window

rangeBetween(start, end)

Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive).

Window

rowsBetween(start, end)

Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive).

Window

unboundedPreceding

Value representing the first row in the partition, for use in frame boundary definition.

Window

unboundedFollowing

Value representing the last row in the partition, for use in frame boundary definition.

Window

currentRow

Value representing the current row, for use in frame boundary definition.

WindowSpec

partitionBy(*cols)

Defines the partitioning columns in a WindowSpec.

WindowSpec

orderBy(*cols)

Defines the ordering columns in a WindowSpec.

WindowSpec

rangeBetween(start, end)

Defines the frame boundaries, from start (inclusive) to end (inclusive).

WindowSpec

rowsBetween(start, end)

Defines the frame boundaries, from start (inclusive) to end (inclusive).

High compatibility APIs

APIs with high compatibility work correctly but might have minor differences in error messages, output format, or edge cases.

Group

Method

Description

DataFrame

agg(*exprs)

Aggregates on the entire DataFrame without groups (shorthand for df.groupBy().agg()).

DataFrame

colRegex(colName)

Selects column based on the column name specified as a regex and returns it as a Column.

DataFrame

corr(col1, col2[, method])

Calculates the correlation of two columns as a float.

DataFrame

cov(col1, col2)

Calculates the sample covariance for the given columns as a float.

DataFrame

crosstab(col1, col2)

Computes a pair-wise frequency table of the given columns.

DataFrame

cube(*cols)

Creates a multi-dimensional cube for the current DataFrame using the specified columns, for running aggregations.

DataFrame

describe(*cols)

Computes basic statistics for numeric and string columns.

DataFrame

distinct()

Returns a new DataFrame containing the distinct rows in this DataFrame.

DataFrame

drop(*cols)

Returns a new DataFrame without the specified columns.

DataFrame

exceptAll(other)

Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

DataFrame

groupBy(*cols)

Groups the DataFrame using the specified columns, returning a GroupedData object for running aggregations.

DataFrame

groupby(*cols)

Alias for groupBy.

DataFrame

intersect(other)

Returns a new DataFrame containing rows only in both this DataFrame and another DataFrame.

DataFrame

intersectAll(other)

Returns a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.

DataFrame

isLocal()

Returns True if collect and take can be run locally.

DataFrame

mapInPandas(func, schema)

Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas.DataFrame.

DataFrame

orderBy(*cols, **kwargs)

Returns a new DataFrame sorted by the specified columns.

DataFrame

rollup(*cols)

Creates a multi-dimensional rollup for the current DataFrame using the specified columns.

DataFrame

sort(*cols, **kwargs)

Returns a new DataFrame sorted by the specified columns.

DataFrame

union(other)

Returns a new DataFrame containing the union of rows in this and another DataFrame.

DataFrame

unionByName(other[, allowMissingColumns])

Returns a new DataFrame containing the union of rows, resolving columns by name.

DataFrame

withColumn(colName, col)

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Column

alias(*alias, **kwargs)

Returns this column aliased with a new name or names.

Column

asc_nulls_first()

Returns a sort expression based on ascending order with null values returned before non-null values.

Column

asc_nulls_last()

Returns a sort expression based on ascending order with null values returned after non-null values.

Column

astype(dataType)

Casts the column into the specified type. Alias for cast.

Column

bitwiseAND(other)

Computes bitwise AND of this expression with another expression.

Column

bitwiseOR(other)

Computes bitwise OR of this expression with another expression.

Column

bitwiseXOR(other)

Computes bitwise XOR of this expression with another expression.

Column

cast(dataType)

Casts the column into the specified type.

Column

desc_nulls_first()

Returns a sort expression based on descending order with null values returned before non-null values.

Column

desc_nulls_last()

Returns a sort expression based on descending order with null values returned after non-null values.

Column

endswith(other)

String ends with.

Column

isNotNull()

Returns True if the current expression is not null.

SparkSession

createDataFrame(data[, schema, …])

Creates a DataFrame from an RDD, a list, or a pandas.DataFrame.

DataFrameReader

csv(path[, schema, sep, …])

Loads a CSV file and returns the result as a DataFrame.

Catalog

currentCatalog()

Returns the current default catalog.

Catalog

listCatalogs([pattern])

Returns a list of catalogs available.

Catalog

listColumns(tableName[, dbName])

Returns a list of columns for the given table/view.

Catalog

recoverPartitions(tableName)

Recovers all the partitions of the given table.

Catalog

setCurrentCatalog(catalogName)

Sets the current default catalog.

Note

  • DataFrame: orderBy / sort: Column ordering is inferred from the last DataFrame in the chain.

  • DataFrame: union / unionByName: Type widening behavior might differ slightly.

  • DataFrame: describe: Statistical output format might vary.

  • Column: cast: Some invalid casts return NULL in Spark but error in Snowpark.

  • Column: alias: Struct field display format might differ.

  • SparkSession: createDataFrame: Schema inference might produce different types (such as NUMBER(38,0) vs LONG).

  • Catalog: listColumns: Column names are uppercase, types are Snowflake-specific. Error messages might differ in format.

Partial compatibility APIs

APIs with partial compatibility are functional but have notable limitations. Behavior might differ from PySpark in specific scenarios.

Group

Method

Description

DataFrame

alias(alias)

Returns a new DataFrame with an alias set.

DataFrame

approxQuantile(col, probabilities, relativeError)

Calculates the approximate quantiles of numerical columns of a DataFrame.

DataFrame

createGlobalTempView(name)

Creates a global temporary view with this DataFrame.

DataFrame

createOrReplaceGlobalTempView(name)

Creates or replaces a global temporary view using this DataFrame.

DataFrame

createOrReplaceTempView(name)

Creates or replaces a local temporary view with this DataFrame.

DataFrame

createTempView(name)

Creates a local temporary view with this DataFrame.

DataFrame

explain([extended, mode])

Prints the (logical and physical) plans to the console for debugging purposes.

DataFrame

filter(condition)

Filters rows using the given condition.

DataFrame

freqItems(cols[, support])

Finds all items which have a frequency greater than or equal to a fraction of the total number of rows.

DataFrame

hint(name, *parameters)

Specifies some hint on the current DataFrame.

DataFrame

inputFiles()

Returns a best-effort snapshot of the files that compose this DataFrame.

DataFrame

printSchema([level])

Prints out the schema in tree format.

DataFrame

randomSplit(weights[, seed])

Randomly splits this DataFrame into separate DataFrames with the given weights.

DataFrame

repartition(numPartitions, *cols)

Returns a new DataFrame partitioned by the given partitioning expressions.

DataFrame

sameSemantics(other)

Returns True when the logical query plans inside both DataFrames are equal and therefore return the same results.

DataFrame

sample([withReplacement, …])

Returns a sampled subset of this DataFrame.

DataFrame

sampleBy(col, fractions[, seed])

Returns a stratified sample without replacement based on the fraction given on each stratum.

DataFrame

selectExpr(*expr)

Projects a set of SQL expressions and returns a new DataFrame.

DataFrame

semanticHash()

Returns a hash code of the logical query plan against this DataFrame.

DataFrame

sortWithinPartitions(*cols, …)

Returns a new DataFrame with each partition sorted by the specified columns.

DataFrame

subtract(other)

Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame.

DataFrame

summary(*statistics)

Computes specified statistics for numeric and string columns.

DataFrame

transform(func, *args, **kwargs)

Returns a new DataFrame by applying a chain of custom transformations.

DataFrame

withColumns(*colsMap)

Returns a new DataFrame by adding multiple columns or replacing existing columns that have the same names.

DataFrame

withMetadata(columnName, metadata)

Returns a new DataFrame by updating an existing column with metadata.

Column

dropFields(*fieldNames)

Returns a new Column with the specified nested fields dropped.

Column

ilike(other)

SQL ILIKE expression (case-insensitive LIKE).

Column

over(window)

Defines a windowing column.

Column

rlike(other)

SQL RLIKE expression (regex match).

Column

withField(fieldName, col)

Returns a new Column with a field added or replaced in a StructType.

SparkSession

addArtifact(*path)

Adds an artifact to the session.

SparkSession

addArtifacts(*path)

Adds artifacts to the session.

SparkSession

addTag(tag)

Adds a tag to be assigned to all operations started by this thread in this session.

SparkSession

clearTags()

Clears the current thread’s operation tags.

SparkSession

getTags()

Returns the operation tags that are currently set to be assigned to all operations started by this thread.

SparkSession

interruptAll()

Interrupts all operations of this session currently running on the connected server.

SparkSession

interruptOperation(op_id)

Interrupts an operation of this session with the given operation ID.

SparkSession

interruptTag(tag)

Interrupts all operations of this session with the given operation tag.

SparkSession

removeTag(tag)

Removes a tag previously added to be assigned to all operations started by this thread in this session.

GroupedData

apply(udf)

Maps each group of the current DataFrame using a pandas UDF.

GroupedData

avg(*cols)

Computes average values for each numeric column for each group.

GroupedData

sum(*cols)

Computes the sum for each numeric column for each group.

DataFrameReader

json(path[, schema, …])

Loads JSON files and returns the result as a DataFrame.

DataFrameReader

load([path, format, schema, …])

Loads data from a data source and returns it as a DataFrame.

DataFrameReader

parquet(*paths, **options)

Loads Parquet files, returning the result as a DataFrame.

DataFrameReader

jdbc(url, table[, column, …])

Constructs a DataFrame representing the database table accessible via JDBC URL.

DataFrameWriter

csv(path, mode, …)

Saves the content of the DataFrame in CSV format at the specified path.

DataFrameWriter

json(path, mode, …)

Saves the content of the DataFrame in JSON format at the specified path.

DataFrameWriter

options(**options)

Adds output options for the underlying data source.

DataFrameWriter

parquet(path, mode, …)

Saves the content of the DataFrame in Parquet format at the specified path.

DataFrameWriterV2

append()

Appends the contents of the DataFrame to the output table.

DataFrameWriterV2

create()

Creates a new table from the contents of the DataFrame.

DataFrameWriterV2

createOrReplace()

Creates a new table or replaces an existing table with the contents of the DataFrame.

DataFrameWriterV2

option(key, value)

Adds a write option.

DataFrameWriterV2

options(**options)

Adds write options.

DataFrameWriterV2

partitionedBy(col, *cols)

Specifies a partitioning column.

DataFrameWriterV2

tableProperty(property, value)

Adds a table property.

DataFrameWriterV2

using(provider)

Specifies a provider for the underlying output data source.

Note

  • DataFrame: explain: Query plan format differs from Spark.

  • DataFrame: repartition: Partition count might not be exact.

  • DataFrame: sample: Random sampling implementation differs.

  • DataFrame: createTempView: View lifecycle might differ.

  • Column: over: Window frame specifications might have subtle differences.

  • Column: rlike: Regex syntax follows Snowflake conventions.

  • SparkSession: Tags are mapped to Snowflake query tags. Interrupt operations use Snowflake query IDs instead of operation IDs.

  • DataFrameReader: File paths use Snowflake stages or cloud storage (S3, GCS, Azure). Schema inference might differ from native Spark. Some format-specific options might not be supported.

  • DataFrameWriter: Writes go to Snowflake stages or cloud storage. Partitioning behavior might differ.

Unsupported APIs

The following APIs are not currently supported in Snowpark Connect for Spark.

Group

Method

Description

DataFrame

dropDuplicatesWithinWatermark([subset])

Returns a new DataFrame with duplicate rows removed within watermark. Streaming only.

DataFrame

observe(observation, *exprs)

Defines (named) metrics to observe on the DataFrame.

DataFrame

pandas_api([index_col])

Converts the existing DataFrame into a pandas-on-Spark DataFrame.

DataFrame

registerTempTable(name)

Registers this DataFrame as a temporary table. Deprecated since 2.0; use createOrReplaceTempView instead.

DataFrame

to_pandas_on_spark([index_col])

Converts the existing DataFrame into a pandas-on-Spark DataFrame. Alias for pandas_api.

DataFrame

withWatermark(eventTime, delayThreshold)

Defines an event time watermark for this DataFrame.

SparkSession

copyFromLocalToFs(local_path, dest_path)

Copies a local file to a remote filesystem.

SparkSession

stop()

Stops the underlying SparkContext.

GroupedData

applyInPandasWithState(func, …)

Applies a function to each group of data using pandas with state.

GroupedData

cogroup(other)

Cogroups this group with another group.

DataFrameReader

orc(path, …)

Loads ORC files, returning the result as a DataFrame.

DataFrameWriter

bucketBy(numBuckets, col, *cols)

Buckets the output by the given columns.

DataFrameWriter

insertInto(tableName[, overwrite])

Inserts the content of the DataFrame to the specified table.

DataFrameWriter

jdbc(url, table[, mode, properties])

Saves the content of the DataFrame to an external database table via JDBC.

DataFrameWriter

orc(path, mode, …)

Saves the content of the DataFrame in ORC format at the specified path.

DataFrameWriter

sortBy(col, *cols)

Specifies sorting columns for each output partition.

Catalog

createExternalTable(tableName, …)

Creates a table based on the dataset in a data source.

Catalog

createTable(tableName, …)

Creates a table based on the dataset in a data source.

Catalog

functionExists(functionName[, dbName])

Checks if the function with the specified name exists.

Catalog

getFunction(functionName)

Gets the function with the specified name.

Catalog

listFunctions([dbName])

Returns a list of functions registered in the specified database.

Catalog

registerFunction(name, f, returnType)

Registers a Python function as a UDF.