File I/O with Snowpark Connect for Spark¶
With Snowpark Connect for Spark, you can read and write data in several file formats using the standard Spark DataFrame reader and writer APIs. You can use internal stages, external stages, and cloud storage locations as sources and destinations. For how to configure stages and cloud URIs, see External data sources with Snowpark Connect for Spark.
When you set Spark-style options (such as header, delimiter, or dateFormat), Snowpark Connect for Spark translates them into the
corresponding Snowflake file format options before executing the operation. The
tables in the Format-specific options section show exactly how each Spark option maps to its Snowflake counterpart.
Supported formats¶
| Format | Read | Write |
|---|---|---|
| CSV | Supported | Supported |
| JSON | Supported | Supported |
| Parquet | Supported | Supported |
| Text | Supported | Supported |
| XML | Supported | Not supported |
| Avro | Not supported | Not supported |
| ORC | Not supported | Not supported |
Reading data¶
Use the standard Spark read API with format short-hands or .format(). Paths can be Snowflake stage notation
(for example, @my_stage/path/), cloud URIs configured for your account, or local file paths on the machine
where Snowpark Connect for Spark is running. Chain .option() calls to control format behavior.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Reading files with SQL identifiers¶
In addition to the DataFrameReader API, you can read files directly from SQL using Spark’s format-prefix
identifier syntax with spark.sql():
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Supported prefixes: csv.…, `json.`..., parquet.…, `text.`....
The path inside the backticks is treated the same as it would be by spark.read.<format>(path).
Snowflake stage paths (@stage/...), cloud URIs, and local paths all work. Default format options
apply (no header for CSV, automatic schema inference, and so on). If you need non-default options,
use spark.read instead because the SQL identifier form doesn’t accept per-call options.
You can combine these file reads with the rest of your SQL: joins, aggregations, CTEs, and subqueries all work on top of them.
Writing data¶
Use the DataFrame write API with the same format short-hands.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Reading from and writing to Snowflake tables¶
You can treat Snowflake tables as Spark DataFrame sources and sinks.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Save modes¶
Set mode on the writer to control how writes interact with existing data. Not all modes are available for every
format or destination type.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
| Mode | Spark name | Behavior in Snowpark Connect for Spark |
|---|---|---|
| Error if exists | error / errorifexists | Default. Fails if data already exists at the target path or table. |
| Overwrite | overwrite | Removes existing files at the target path before writing. For table writes, replaces the table. |
| Append | append | Adds new files alongside existing ones. Uses a random filename prefix to avoid conflicts. |
| Ignore | ignore | Skips the write if data already exists. Supported for Parquet file writes and table writes only. CSV, JSON, and text file writes raise an error for this mode. |
Controlling output file count and size¶
By default, Snowflake decides how to split your output into files. You can control this with standard Spark APIs and Snowpark Connect for Spark-specific options.
coalesce(1) for a single output file¶
The most common Spark idiom for producing one output file is honored:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
When coalesce(1) is used without partitionBy, Snowpark Connect for Spark routes the write into a single
Snowflake file.
repartition(n) for multiple output files¶
When you request n output files, Snowpark Connect for Spark produces n files with Spark-compatible names:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Rows are distributed across the n target files so each file gets roughly the same amount of
data. When you combine repartition(n) with partitionBy, the per-partition directory layout
takes precedence and Snowflake controls the file count inside each col=value/ directory.
single and snowflake_max_file_size write options¶
Snowpark Connect for Spark provides two additional write options for explicit control over output layout:
| Option | Description |
|---|---|
single | Set to true to force a single output file regardless of how many input partitions the DataFrame has. Equivalent to coalesce(1) for non-partitioned writes. Ignored when used with partitionBy. |
snowflake_max_file_size | Maximum size in bytes per output file. Larger values reduce the file count; smaller values produce more files. When single-file behavior is active and this option isn’t set, Snowpark Connect for Spark defaults the cap to 1 GB. |
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Compression¶
Set compression through .option("compression", "<codec>"). For writes, the default is NONE for CSV, JSON,
and text, and SNAPPY for Parquet. For reads, compression is auto-detected if not specified.
For best performance, use splittable compression formats such as BZ2 or SNAPPY. Splittable formats allow
Snowflake to decompress and process file chunks in parallel, which is significantly faster for large files than
non-splittable formats like GZIP.
Snowpark Connect for Spark normalizes codec names: UNCOMPRESSED becomes NONE, and Spark’s BZIP2 becomes BZ2
for CSV, JSON, and text. Compression can also be inferred from file extensions (.gz, .bz2,
.snappy, .deflate).
| CSV / JSON / Text | Parquet |
|---|---|
| GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, NONE | SNAPPY, LZO, NONE |
Note
Compression isn’t supported for XML reads.
Parallel file reads¶
Multiple files¶
When you read from a directory or pass a list of file paths, Snowpark Connect for Spark reads the files in parallel automatically. No additional configuration is required.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Single large file¶
Snowpark Connect for Spark can also split a single large file into chunks and read them in parallel, so one large file doesn’t become a bottleneck. This is supported for CSV, JSON, and XML.
CSV: The file must be uncompressed (compression set to none) and multiLine must be
false (the default):
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
JSON: The file must be uncompressed (compression set to none) or BZ2-compressed, and multiLine
must be false (the default):
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
XML: Parallel reads are enabled by default:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Automatic exclusion of metadata files¶
When you read from a directory, Snowpark Connect for Spark automatically skips files that Spark also skips, so you don’t have to filter them out manually:
| Excluded pattern | Reason |
|---|---|
_SUCCESS | Spark write-completion marker |
_metadata, _common_metadata | Parquet metadata sidecars |
*.crc | Hadoop checksum files |
.DS_Store, other .* files | macOS and hidden files |
Any file starting with _ or . | General Hadoop convention |
This exclusion applies to CSV, JSON, Parquet, and XML reads at any directory depth, including inside
partition subdirectories (for example, year=2025/_SUCCESS is skipped).
Snowpark Connect for Spark also anchors read paths so that files outside your requested prefix are never picked up.
For example, reading from @stage/sales/ won’t accidentally include files from @stage/sales_archive/.
If you set pathGlobFilter to a pattern that explicitly matches hidden or metadata files
(for example, _*), that pattern takes precedence and those files are included.
Format-specific options¶
Spark options set through .option() are translated into Snowflake
file format options before Snowpark Connect for Spark executes the read or write
operation. The tables below show how each Spark option maps to its Snowflake counterpart and note any behavioral
differences.
Options that don’t have a Snowflake equivalent are ignored. Snowpark Connect for Spark logs a warning for each unsupported option.
CSV options¶
| Spark option | Snowflake file format option | Notes |
|---|---|---|
sep / delimiter | FIELD_DELIMITER | Default ,. delimiter is an alias for sep. |
header | PARSE_HEADER | When true, the reader sets PARSE_HEADER and uses column names from the first row. When writing, controls whether a header row is written. |
quote | FIELD_OPTIONALLY_ENCLOSED_BY | Default " for reads. Semantics differ from Spark: Snowflake uses optional enclosure rather than mandatory quoting. An empty value removes enclosure. |
escape | ESCAPE | A single \ is normalized to \\. Dropped when the field delimiter is \ or when escape equals quote. |
nullValue | NULL_IF | Spark defaults to ""; Snowflake defaults to \N. Explicitly set this option if your files use a specific null representation. |
lineSep | RECORD_DELIMITER | Default \n. |
encoding / charset | ENCODING | Default UTF-8. charset is an alias for encoding. |
multiLine | MULTI_LINE | Default false. |
dateFormat | DATE_FORMAT | Java SimpleDateFormat patterns are automatically converted to Snowflake tokens (for example, dd/MM/yyyy becomes DD/MM/YYYY). See Date and timestamp format conversion. |
timestampFormat | TIMESTAMP_FORMAT | Default yyyy-MM-dd HH:mm:ss.SSSSSS. Same Java-to-Snowflake token conversion as dateFormat. |
compression | COMPRESSION | See Compression. |
inferSchema | INFER_SCHEMA | Default false (read only). |
rowsToInferSchema | INFER_SCHEMA_OPTIONS.MAX_RECORDS_PER_FILE | Read only. Default 20000. Sets the number of rows sampled per file when inferring the schema. Reduce this value to speed up reads when the data is uniform. Requires inferSchema = true. |
mode | ON_ERROR | Controls error handling on read. DROPMALFORMED maps to CONTINUE, FAILFAST maps to ABORT_STATEMENT. PERMISSIVE (the Spark default) is only partially supported. |
ignoreLeadingWhiteSpace / ignoreTrailingWhiteSpace | TRIM_SPACE | If either option is true, Snowflake trims whitespace from both sides. One-sided trimming isn’t available. |
mergeSchema | (internal) | When true, Snowpark Connect for Spark infers schemas per file group and produces a union schema. Requires header = true and inferSchema = true. New columns across files become nullable. Type widening (for example, INT to STRING) is supported. |
relaxTypesToInferSchema | (internal) | When true, widens narrow numeric types to match Spark’s inference behavior (for example, Short to Integer, small Decimal to Long). Enabled automatically when rowsToInferSchema is set. |
pathGlobFilter | PATTERN | Filters which files to read based on a glob pattern. |
In addition, Snowpark Connect for Spark always sets the following on CSV reads:
ESCAPE_UNENCLOSED_FIELD = NONEERROR_ON_COLUMN_COUNT_MISMATCH = FalseSKIP_BLANK_LINES = True
JSON options¶
| Spark option | Snowflake file format option | Notes |
|---|---|---|
multiLine | STRIP_OUTER_ARRAY + MULTI_LINE | Both Snowflake options are set to the same value. When true, the reader handles files that contain a JSON array at the top level or span multiple lines. |
dateFormat | DATE_FORMAT | Default auto. Unlike CSV, JSON date format strings are not converted from Java patterns. Pass Snowflake-compatible format tokens directly. |
timestampFormat | TIMESTAMP_FORMAT | Default auto. Same as dateFormat: pass Snowflake-compatible tokens. |
encoding | ENCODING | Default UTF-8. |
mode | ON_ERROR | Controls error handling on read. DROPMALFORMED maps to CONTINUE, FAILFAST maps to ABORT_STATEMENT. PERMISSIVE (the Spark default) is only partially supported. |
dropFieldIfAllNull | (schema inference) | When true, fields that are null in every sampled row are dropped from the inferred schema. |
rowsToInferSchema | INFER_SCHEMA_OPTIONS.MAX_RECORDS_PER_FILE | Read only. Sets the number of rows sampled per file when inferring the schema. Reduce this value to speed up reads when the data is uniform. |
relaxTypesToInferSchema | (internal) | When true, widens narrow numeric types to match Spark’s inference behavior (for example, Short to Integer, small Decimal to Long). Enabled automatically when rowsToInferSchema is set. |
pathGlobFilter | PATTERN | Filters which files to read. |
compression | COMPRESSION | See Compression. |
When writing JSON, Snowpark Connect for Spark converts the DataFrame into a single VARIANT column using OBJECT_CONSTRUCT
before unloading. On the write path, nullValue is mapped to NULL_IF and compression is mapped to
COMPRESSION. Other Spark JSON writer options (such as dateFormat) aren’t applied.
Parquet options¶
| Spark option | Snowflake file format option | Notes |
|---|---|---|
pathGlobFilter | PATTERN | Filters which files to read. |
compression | COMPRESSION | Default SNAPPY for writes, AUTO for reads. Allowed values: AUTO, LZO, SNAPPY, NONE. UNCOMPRESSED is also accepted and normalized to NONE. |
mergeSchema | (internal) | When true and reading from multiple paths, Snowpark Connect for Spark infers the schema from each file and produces a union schema. New columns are added; type widening across files (for example, INT to STRING) isn’t supported. |
rowsToInferSchema | (internal) | Read only. Default 20000. Controls the sample size used by Snowpark Connect for Spark when discovering complex types (STRUCT, MAP, ARRAY) inside Parquet VARIANT columns. |
nullValue | NULL_IF | Write only. Sets the string that represents null values in the output files. |
Snowpark Connect for Spark also sets these options automatically based on session configuration:
BINARY_AS_TEXT = False(always)USE_LOGICAL_TYPE: controlled bysnowpark.connect.parquet.useLogicalType
When writing Parquet, structured complex types (ARRAY, MAP, STRUCT) are cast to VARIANT before
unload so that Snowflake’s COPY INTO can produce valid Parquet files.
Text options¶
Text I/O uses a CSV file format internally with FIELD_DELIMITER = NONE.
| Spark option | Snowflake file format option | Notes |
|---|---|---|
lineSep / linesep | RECORD_DELIMITER | Default \n. Controls the line separator for both reads and writes. |
wholetext | RECORD_DELIMITER = NONE | When true, each file is read as a single row. |
compression | COMPRESSION | Default auto for reads, NONE for writes. |
Text writes always set ESCAPE_UNENCLOSED_FIELD = NONE and FILE_EXTENSION = txt. The DataFrame must
contain exactly one string column.
XML options (read only)¶
XML supports schema inference. If no schema is provided, Snowpark Connect for Spark infers the schema from the data by reading the files and merging field types across all input.
| Spark option | Internal mapping | Notes |
|---|---|---|
rowTag | rowTag | Specifies the XML element that maps to a DataFrame row. Defaults to ROW if not set. |
inferSchema | inferSchema | Default true. When a user schema is provided, Snowpark Connect for Spark forces inferSchema=false to keep the read single-pass. |
samplingRatio | samplingRatio | Default 1.0. Fraction of rows to sample when inferring the schema. |
mode | mode | Default PERMISSIVE. Controls error handling: PERMISSIVE, DROPMALFORMED, or FAILFAST. PERMISSIVE is partially supported: if input data doesn’t match the specified schema type and can’t be coerced, an error might be thrown instead of placing the row in the corrupt record column. |
columnNameOfCorruptRecord | columnNameOfCorruptRecord | Default _corrupt_record. Column name used to store malformed rows in PERMISSIVE mode. |
attributePrefix | attributePrefix | Default _. Prefix added to XML attribute names in the DataFrame schema. Can’t equal valueTag. |
valueTag | valueTag | Default _VALUE. Tag name for the text content of mixed-content elements. Can’t be empty. |
encoding | charset | Renamed internally. Default UTF-8. |
excludeAttribute | excludeAttributes | Renamed internally. Default false. |
ignoreSurroundingSpaces | ignoreSurroundingWhitespace | Renamed internally. Default false. |
ignoreNamespace | ignoreNamespace | Default false. When true, XML namespace prefixes are stripped from element and attribute names. |
nullValue | nullValue | Default "". String value to interpret as null. |
pathGlobFilter | PATTERN | Filters which files to read. |
rowValidationXSDPath | rowValidationXSDPath | Path to an XSD file on a stage (for example, @my_stage/schema.xsd) to validate each XML row against. Local XSD files are automatically uploaded to the stage. Rows that fail validation are handled according to the mode setting. |
Date and timestamp format conversion¶
For CSV reads and writes, Snowpark Connect for Spark automatically converts Java SimpleDateFormat patterns (used by Spark)
to Snowflake file format tokens. Common conversions:
| Java/Spark pattern | Snowflake token | Example |
|---|---|---|
yyyy | YYYY | Four-digit year |
yy | YY | Two-digit year |
MM | MM | Zero-padded month |
MMM | MON | Abbreviated month name (Jan, Feb) |
dd | DD | Zero-padded day of month |
HH | HH24 | 24-hour clock (00-23) |
hh | HH12 | 12-hour clock (01-12) |
mm | MI | Minutes |
ss | SS | Seconds |
SSS | FF3 | Fractional seconds (milliseconds) |
SSSSSS | FF6 | Fractional seconds (microseconds) |
a | AM | AM/PM marker |
Some patterns don’t have exact Snowflake equivalents:
- Unpadded values: Spark’s single-letter patterns (
dfor day,hfor hour) produce unpadded output (for example,9), but Snowflake always zero-pads (DDproduces09). - Full day name: Spark’s
EEEE(Thursday) maps to abbreviatedDY(Thu) in Snowflake. - Timezone offsets: Spark’s
Z(-0500) maps toTZHTZM(-0500) without a colon separator. The colon-separatedTZH:TZMformat (-05:00) applies toXXXandxxxpatterns.
Important
This conversion applies only to CSV. For JSON, date and timestamp format strings are passed directly to
Snowflake without conversion. Use Snowflake-compatible format tokens (such as YYYY-MM-DD) when setting
dateFormat or timestampFormat on JSON readers.
Partitioned data¶
Snowpark Connect for Spark supports Hive-style col=value/ directory layouts for CSV, JSON, and Parquet formats.
Writing partitioned data¶
Use the standard partitionBy(...) API. Snowpark Connect for Spark produces the same col=value/ directory tree that Spark
does:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Note
Unlike open-source Spark, Snowpark Connect for Spark includes the partition columns in the written data files themselves, not only in the directory structure. The output files contain all columns of the DataFrame, including the ones used for partitioning.
Reading partitioned data¶
Point the reader at the root directory and Snowpark Connect for Spark discovers partition columns from the directory names automatically:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Partition discovery behavior:
- Partition columns appear at the end of the schema, in the order they appear in the directory tree.
- Partition value types are inferred from the observed values: all integers become
IntegerType, all floating-point values becomeDoubleType, otherwiseStringType. You can override a partition column’s type by supplying a.schema(...)that includes it. nullvalues in a partition column are written as the directory segment__HIVE_DEFAULT_PARTITION__and read back asnull.- Mixed-depth or conflicting layouts (for example, the same key at two different depths) raise an error rather than producing incorrect values.
Important
Filters on partition columns don’t reduce the number of files read. Snowpark Connect for Spark reads all files in the directory subtree and applies filters after loading.
Dynamic partition overwrite¶
When writing partitioned data to a stage, you can overwrite only the partitions that appear in the current
DataFrame instead of deleting all existing data. Set the spark.sql.sources.partitionOverwriteMode session
configuration to dynamic:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
This removes only the partition subdirectories that match the DataFrame. See Snowpark Connect for Spark properties for more on this configuration property.
Note
The .option("overwrite-mode", "dynamic") writer option is supported only for
Iceberg table writes. For stage-based file writes, use the session configuration shown above.
Known limitations¶
- Save mode support varies by format:
ignoremode is supported only for Parquet file writes and table writes. CSV, JSON, and text file writes raise an error forignoremode. - ORC and Avro: These formats aren’t supported for read or write.
- Bucketed writes:
bucketByandsortByaren’t supported for file or table writes. - One-sided whitespace trimming: Spark’s
ignoreLeadingWhiteSpaceandignoreTrailingWhiteSpaceboth map toTRIM_SPACE, which trims from both sides. Trimming only leading or only trailing whitespace isn’t possible. - CSV quote semantics: Spark’s
quoteoption and Snowflake’sFIELD_OPTIONALLY_ENCLOSED_BYhave different semantics. Snowflake treats enclosure as optional, while Spark applies mandatory quoting rules. - Text column constraint: Text writes require exactly one string column in the DataFrame.
- XML MapType:
MapTypeisn’t supported for XML reads. UseStructTypeto represent key-value structures. - XML via Spark SQL: Reading XML files with
spark.sql()(for example,SELECT * FROM xml.@stage/file.xml``) isn’t supported. Use theDataFrameReaderAPI instead.