Snowpark Connect for Spark properties¶
Snowpark Connect for Spark supports custom configuration in a way that’s similar to standard Spark. You can modify configuration properties only
through the session’s set method by using a key-value pair. Note that Snowpark Connect for Spark recognizes only a limited set of properties that
influence execution. Any unsupported properties are silently ignored without raising an exception.
Supported Spark properties¶
Snowpark Connect for Spark supports a subset of Spark properties.
Property Name |
Default |
Meaning |
Since |
|---|---|---|---|
|
(none) |
Application name set as Snowflake |
1.0.0 |
|
|
When |
1.0.0 |
|
(none) |
AWS access key ID for S3 authentication when reading or writing to S3 locations. |
1.0.0 |
|
(none) |
AWS IAM role ARN with S3 access when using role-based authentication. |
1.0.0 |
|
(none) |
AWS secret access key for S3 authentication when reading or writing to S3 locations. |
1.0.0 |
|
(none) |
AWS KMS key ID for server-side encryption when using the |
1.0.0 |
|
(none) |
AWS session token for temporary S3 credentials when using STS. |
1.0.0 |
|
|
Enables ANSI SQL mode for stricter type checking and error handling. When |
1.0.0 |
|
|
Controls case sensitivity for identifiers. When |
1.0.0 |
|
|
Enables or disables implicit cross joins. A |
1.0.0 |
|
|
When |
1.0.0 |
|
|
Schema name for global temporary views; created automatically if it does not exist. |
1.0.0 |
|
|
When |
1.0.0 |
|
|
Legacy behavior for dataset grouping key naming. |
1.6.0 |
|
|
Controls behavior when duplicate keys are found in map creation. Values: |
1.0.0 |
|
|
When |
1.0.0 |
|
|
Controls Parquet output timestamp type. Supports |
1.7.0 |
|
|
When |
1.0.0 |
|
|
When |
1.0.0 |
|
|
When |
1.0.0 |
|
|
Maximum number of rows to display in REPL eager evaluation mode. |
1.0.0 |
|
|
Maximum width for column values in REPL eager evaluation display before truncation. |
1.0.0 |
|
|
Byte threshold for caching local relations. Relations larger than this are cached to improve performance. |
1.0.0 |
|
|
Session timezone used for timestamp operations. Synced with Snowflake session via |
1.0.0 |
|
|
Default data source format for read/write operations when format is not explicitly specified. |
1.0.0 |
|
|
Default timestamp type for timestamp operations. Values: |
1.0.0 |
|
|
When |
1.0.0 |
Supported Snowpark Connect for Spark properties¶
Custom configuration properties specific to Snowpark Connect for Spark.
Property Name |
Default |
Meaning |
Since |
|---|---|---|---|
|
(none) |
Azure SAS token for Blob Storage authentication. Used when reading or writing to Azure Blob Storage locations. |
1.0.0 |
|
(none) |
Azure SAS token for ADLS Gen2 (Data Lake Storage) authentication. Used when reading or writing to Azure Data Lake Storage Gen2 locations. |
1.0.0 |
|
|
When |
1.0.0 |
|
|
Alternative config for generating Parquet summary metadata files. Either this or |
1.4.0 |
|
|
When |
1.0.0 |
|
|
When |
1.0.0 |
|
|
Time-to-live in seconds for query cache entries. Reduces repeated schema lookups. |
1.0.0 |
|
|
When |
1.0.0 |
|
|
When |
1.7.0 |
|
(none) |
Snowflake external volume name for Iceberg table operations. |
1.0.0 |
|
|
Controls conversion of decimal to integral types. Values: |
1.7.0 |
|
|
Controls the Scala version used (supports |
1.7.0 |
|
(none) |
External table location path for partitioned writes. |
1.4.0 |
|
|
When |
1.0.1 |
|
(none) |
Comma-separated list of files or modules to import for UDF execution. Triggers UDF recreation when changed. |
1.0.0 |
|
(none) |
Comma-separated list of files/modules to import for python UDF execution. Triggers UDF recreation when changed. |
1.7.0 |
|
(none) |
Comma-separated list of files or modules to import for Java UDF execution. Triggers UDF recreation when changed. |
1.7.0 |
|
(none) |
Comma-separated list of Python packages to include when registering UDFs. |
1.0.0 |
|
|
When |
1.0.0 |
|
|
Read-only. Returns the current Snowpark Connect for Spark version. |
1.0.0 |
|
|
How to handle duplicate column names in views. Values: |
1.0.0 |
|
|
When |
1.4.0 |
|
|
When |
1.12.3 |
|
(none) |
Specifies the name of a Snowflake artifact repository for UDF/UDTF package resolution. When set, packages are resolved from the specified repository instead of Anaconda. |
1.14.0 |
|
(none) |
When set to |
1.13.0 |
fs.azure.sas.<container>.<account>.blob.core.windows.net¶
Specifies the Azure SAS token for Blob Storage authentication. Used when reading or writing to Azure Blob Storage locations.
Default: (none)
Since: 1.0.0
fs.azure.sas.fixed.token.<account>.dfs.core.windows.net¶
Specifies the Azure SAS token for ADLS Gen2 (Data Lake Storage) authentication. Used when reading or writing to Azure Data Lake Storage Gen2 locations.
Default: (none)
Since: 1.0.0
mapreduce.fileoutputcommitter.marksuccessfuljobs¶
Specify true to generate _SUCCESS file after successful write operations for compatibility with Hadoop or Spark workflows.
Default: false
Since: 1.0.0
parquet.enable.summary-metadata¶
Specifies the alternative configuration for generating Parquet summary metadata files. Enables that feature with this property or spark.sql.parquet.enable.summary-metadata.
Default: false
Since: 1.4.0
snowflake.repartition.for.writes¶
Specify true to force DataFrame.repartition(n) to split output into n files during writes. Matches Spark behavior but adds overhead.
Default: false
Since: 1.0.0
snowpark.connect.cte.optimization_enabled ¶
Specify true to enable Common Table Expression (CTE) optimization in the Snowpark session for query performance.
Default: false
Since: 1.0.0
snowpark.connect.describe_cache_ttl_seconds ¶
Specifies the time to live, in seconds, for query cache entries. Reduces repeated schema lookups.
Default: 300
Since: 1.0.0
snowpark.connect.enable_snowflake_extension_behavior ¶
Specify true to enable Snowflake-specific extensions that can differ from Spark behavior (such as a hash on MAP types MD5 return type).
Default: false
Since: 1.0.0
Comments¶
When set to true, changes the behavior of certain operations:
bit_get/getbit— Explicit use of Snowflake getbit function
hash— Explicit use of Snowflake hash function
md5— Explicit use of Snowflake md5 functionRenaming table columns — Allows for altering table columns
snowpark.connect.handleIntegralOverflow¶
Specify true to align integral overflow behavior with the Spark approach.
Default: false
Since: 1.7.0
snowpark.connect.iceberg.external_volume ¶
Specifies the Snowflake external volume name for Iceberg table operations.
Default: (none)
Since: 1.0.0
snowpark.connect.integralTypesEmulation¶
Specifies how to convert decimal to integral types. Values: client_default, enabled, disabled
Default: client_default
Since: 1.7.0
Comments¶
By default, Snowpark Connect for Spark treats all integral types as Long types. This is caused by the way numbers are represented in Snowflake. Integral types emulation allows for an exact mapping between Snowpark and Spark types when reading from datasources.
The default option client_default activates the emulation only when the script is executed from the Scala client. Integral types are mapped based on the following precisions:
Precision |
Spark Type |
|---|---|
19 |
|
10 |
|
5 |
|
3 |
|
Other |
|
When other precisions are found, the final type is mapped to the DecimalType.
snowpark.connect.scala.version¶
Specifies the Scala version to use (supports 2.12 or 2.13).
Default: 2.12
Since: 1.7.0
snowpark.connect.sql.partition.external_table_location ¶
Specifies the external table location path for partitioned writes.
Default: (none)
Since: 1.4.0
Comments¶
To read only an exact subset of partitioned files from the provided directory, additional configuration is required. This feature is only available for files stored on external stages. To prune the read files, Snowpark Connect for Spark uses external tables.
This feature is enabled when the configuration snowpark.connect.sql.partition.external_table_location is set. It should contain existing database and schema names where external tables will be created.
Reading parquet files that are stored on external stages will create an external table; for files on internal stages, it will not be created. Providing the schema will reduce the execution time, eliminating the cost of inferring it from sources.
For best performance, filter according to the Snowflake External Tables filtering limitations.
Example¶
snowpark.connect.temporary.views.create_in_snowflake ¶
Specify true to create temporary views directly in Snowflake instead of managing them locally.
Default: false
Since: 1.0.1
snowpark.connect.udf.imports [DEPRECATED 1.7.0]¶
Specifies a comma-separated list of files and modules to import for UDF execution. When this value is changed, it triggers UDF recreation.
Default: (none)
Since: 1.0.0
snowpark.connect.udf.python.imports¶
Specifies a comma-separated list of files and modules to import for python UDF execution. When this value is changed, it triggers UDF recreation.
Default: (none)
Since: 1.7.0
snowpark.connect.udf.java.imports¶
Specifies a comma-separated list of files and modules to import for Java UDF execution. Triggers UDF recreation when changed.
Default: (none)
Since: 1.7.0
Comments¶
This configuration works very similarly to the snowpark.connect.udf.python.imports. With it, you can specify external libraries and files for Java UDFs created using registerJavaFunction. Configurations are mutually exclusive to prevent unnecessary dependency mixing.
To include external libraries and files, you provide stage paths to the files as the value of the configuration setting snowpark.connect.udf.java.imports. The configuration value should be an array of stage paths to the files, where the paths are separated by commas.
Example¶
Code in the following example includes two files in the UDF’s execution context. The UDF imports functions from these files and uses them in its logic.
You can use the snowpark.connect.udf.java.imports setting to include other kinds of files as well, such as those with data your code needs to read. Note that when you do this, your code should only read from the included files; any writes to such files will be lost after the function’s execution ends.
snowpark.connect.udf.packages¶
Specifies a comma-separated list of Python packages to include when registering UDFs.
Default: (none)
Since: 1.0.0
Comments¶
You can use this to define additional packages to be available in Python UDFs. The value is a comma-separated list of dependencies.
You can discover the list of supported packages by executing the following SQL in Snowflake:
Example¶
For more information, see Python.
snowpark.connect.udtf.compatibility_mode ¶
Specify true to enables Spark-compatible UDTF behavior for improved compatibility with Spark UDTF semantics.
Default: false
Since: 1.0.0
Comments¶
This property determines whether UDTFs should use Spark-compatible behavior or the default Snowpark behavior. When set to true, it applies a compatibility wrapper that mimics Spark’s output type coercion and error handling patterns.
When enabled, UDTFs use a compatibility wrapper that applies Spark-style automatic type coercion (e.g., string “true” to boolean, boolean to integer) and error handling. The wrapper also converts table arguments to Row-like objects for both positional and named access, and properly handles SQL null values to match Spark’s behavior patterns.
snowpark.connect.version¶
Returns the current Snowpark Connect for Spark version. Read only.
Default: <current_version>
Since: 1.0.0
snowpark.connect.views.duplicate_column_names_handling_mode ¶
Specifies how to handle duplicate column names in views. Allowed values include rename (add suffix) fail (raise error) or drop (remove duplicates).
Default: rename
Since: 1.0.0
Comments¶
Snowflake does not support duplicate column names.
Example¶
The following code fails at the view creation step with the following SQL compilation error: “duplicate column name ‘foo’”.
To work around this, set the snowpark.connect.views.duplicate_column_names_handling_mode configuration option to one of the following values:
rename: A suffix such as_dedup_1,_dedup_2, and so on will be appended to all of the duplicate column names after the first one.drop: All of the duplicate columns except one will be dropped. If the columns have different values, this might lead to incorrect results.
snowpark.connect.udf.java.imports¶
Specifies a comma-separated list of files and modules to import for Java UDF execution. Triggers UDF recreation when changed.
Default: (none)
Since: 1.7.0
Comments¶
This configuration works very similarly to the snowpark.connect.udf.python.imports. You can use it to specify external libraries and files for Java UDFs created using registerJavaFunction. Configurations are mutually exclusive to prevent unnecessary dependency mixing.
To include external libraries and files, you provide stage paths to the files as the value of the configuration setting snowpark.connect.udf.java.imports. The value is an array of stage paths to the files, where the paths are separated by commas.
Example¶
Code in the following example includes two files in the UDF’s execution context. The UDF imports functions from these files and uses them in its logic.
You can use the snowpark.connect.udf.java.imports setting to include other kinds of files as well, such as those with data your code needs to read. When you do this, your code should only read from the included files; any writes to such files will be lost after the function’s execution ends.
snowpark.connect.udf.packages¶
Specifies a comma-separated list of Python packages to include when registering UDFs.
Default: (none)
Since: 1.0.0
Comments¶
Configuration allows for defining additional packages available in Python UDFs. The value is a comma separated list of dependencies.
You can discover the list of supported packages by executing the following SQL in Snowflake:
Example¶
Reference: Packages Reference
snowpark.connect.udtf.compatibility_mode ¶
Specify true to enable Spark-compatible UDTF behavior for improved compatibility with Spark UDTF semantics.
Default: false
Since: 1.0.0
Comments¶
This configuration determines whether UDTFs should use Spark-compatible behavior or the default Snowpark behavior. When enabled (true), it applies a compatibility wrapper that mimics Spark’s output type coercion (for example, string “true” to boolean, boolean to integer) and error handling patterns.
The wrapper also converts table arguments to row-like objects for both positional and named access, and properly handles SQL null values to match Spark’s behavior patterns.
snowpark.connect.sql.emulatePartitionOverwritesForSnowflakeTables¶
When true, allows partition overwrites on Snowflake tables in Spark SQL (INSERT OVERWRITE <table> PARTITION(<partition spec>)).
Default: false
Since: 1.12.3
Comments¶
Snowflake tables do not support user-defined partitioning, and by default, partition overwrites will result in an error. Enabling this option allows using INSERT OVERWRITE <table> PARTITION(<partition spec>) syntax to perform overwrites.
The <partition spec> will accept any columns that exist in the target table.
Example¶
Code in the following example overwrites all rows in the students table that have a student_id of 222222.
snowpark.connect.artifact_repository ¶
Specifies the name of a Snowflake artifact repository to use for package resolution when registering UDFs, UDTFs, applyInPandas, mapInArrow, and cogroup operations. When set, packages specified via snowpark.connect.udf.packages are resolved from the specified artifact repository instead of Anaconda.
Default: (none)
Since: 1.14.0
Comments¶
By default, Snowpark Connect for Spark resolves Python packages from Snowflake’s curated Anaconda channel. Setting this configuration to an artifact repository name allows resolving packages from PyPI or other configured sources, enabling the use of packages that are not available in the Anaconda channel.
For information on how to create and configure an artifact repository in Snowflake, see Using third-party packages.
Changing this configuration invalidates cached UDFs and UDTFs, causing them to be recreated with the new repository on next invocation.
This configuration applies to the following operations:
UDFs registered via
@udfdecorator orspark.udf.register()UDTFs registered via
@udtfdecorator orspark.udtf.register()applyInPandasviagroupBy().applyInPandas()mapInArrowviaDataFrame.mapInArrow()cogroupviagroupBy().cogroup().applyInPandas()
Example¶
The following example configures the artifact repository, then defines a UDF that uses pykalman, a package available from the artifact repository, to apply Kalman filter smoothing.
For more information on artifact repositories and available packages, see Using third-party packages.
snowpark.connect.udf.resource_constraint.architecture ¶
When set to x86, UDFs, UDTFs, and applyInPandas operations are created with an x86 architecture constraint. This requires a warehouse configured with an x86 resource constraint for execution.
Default: (none)
Since: 1.13.0
Comments¶
Some third-party Python packages (such as TensorFlow, XGBoost, and certain scientific libraries) are built only for the x86 CPU architecture. Setting this configuration to x86 adds RESOURCE_CONSTRAINT=(architecture='x86') to the CREATE FUNCTION statement generated by Snowpark Connect for Spark, ensuring the UDF runs on x86-compatible infrastructure.
To use this configuration, you must execute your workload on a warehouse that has been created with an x86 resource constraint. The following resource constraint values support x86:
MEMORY_1X_x86(minimum warehouse size: XSMALL)MEMORY_16X_x86(minimum warehouse size: MEDIUM)MEMORY_64X_x86(minimum warehouse size: LARGE)
If the warehouse does not have an x86 resource constraint, UDF execution will fail.
This configuration applies to the following operations:
UDFs registered via
@udfdecorator orspark.udf.register()UDTFs registered via
@udtfdecorator orspark.udtf.register()applyInPandasviagroupBy().applyInPandas()
Example¶
The following example creates a warehouse with an x86 resource constraint, then configures Snowpark Connect for Spark to use x86 architecture for UDFs.
For more information on warehouses and resource constraints, see Snowpark-optimized warehouses.
Comments¶
Configuration that enables Snowflake Common Table Expressions (CTEs). This configuration optimizes the Snowflake queries in which there are a lot of repetitive code blocks. This modification will lead to improvements in both query compilation and execution performance.