Stronger UTF-8 validation for external files (Pending)¶
This behavior change will be enabled in a future release. For the most up-to-date details about behavior changes, see the Behavior Change Log.
In a future release, Snowflake will have stronger UTF-8 validation for external files.
- Currently
When you query external Avro, Parquet, Orc, CSV, JSON, or XML files that contain invalid UTF-8 data, the queries usually succeed.
- Pending
When you query external Avro, Parquet, Orc, CSV, JSON, or XML files that contain invalid UTF-8 data, the queries will fail.
If you load external files with COPY INTO <table> or Snowpipe that contain invalid UTF-8 data, Snowflake will proceed with the copy option
ON_ERROR = CONTINUE
. Snowflake will consider the record that contains invalid UTF-8 data as an error and will continue to load the file.
To avoid UTF-8 validation errors, Snowflake recommends that you specify REPLACE_INVALID_CHARACTERS = TRUE
for your file format so that any invalid UTF-8 characters will be replaced with the Unicode replacement character (�
).
For Parquet files, you can also set BINARY_AS_TEXT = FALSE
for your file format so that the columns with no defined logical data type will be interpreted as binary data instead of as UTF-8 text.
Note that this behavior change does not apply to existing accounts that are currently loading invalid UTF8. It will only affect new accounts. For any issues, contact Snowflake Support.
Ref: 1013 1014