Collation Support¶
Collation allows you to specify alternative rules for comparing strings, which can be used to compare and sort data according to a particular language or other user-specified rules.
Overview¶
Text strings in Snowflake are stored using the UTF-8 character set and, by default, strings are compared according to the Unicode codes that represent the characters in the string.
However, comparing strings based on their UTF-8 character representations might not provide the desired/expected behavior. For example:
If special characters in a given language do not sort according to that language’s ordering standards, then sorting might return unexpected results.
You might want the strings to be ordered by other rules, such as ignoring whether the characters are uppercase or lowercase.
Collation allows you to explicitly specify the rules to use for comparing strings, based on:
Different locales (i.e. different character sets for different languages).
Case-sensitivity (i.e. whether to use case-sensitive or case-insensitive string comparisons without explicitly calling the UPPER or LOWER functions to convert the strings).
Accent-sensitivity (e.g. whether
Z
,Ź
, andŻ
are considered the same letter or different letters).Punctuation-sensitivity (i.e. whether comparisons use only letters or include all characters). For example, if a comparison is punctuation-insensitive, then
A-B-C
andABC
are treated as equivalent.Additional options, such as preferences for sorting based on the first letter in a string and trimming of leading and/or trailing blank spaces.
Uses for Collation¶
Collation can be used in a wide variety of operations, including (but not limited to):
Usage |
Example |
Link |
---|---|---|
Simple comparison |
|
|
Joins |
|
|
Sorting |
|
|
Top-K sorting |
|
|
Aggregation |
|
|
Window clauses |
|
|
Scalar functions |
|
|
Aggregate functions |
|
|
Data clustering |
|
Collation Control¶
Collation control is granular. You can explicitly specify the collation to use for:
An account, using the account-level parameter DEFAULT_DDL_COLLATION.
All columns in all tables added to a database, using the ALTER DATABASE command.
All columns in all tables added to a schema, using the ALTER SCHEMA command.
All columns added to a table, using the ALTER TABLE command.
Individual columns in a table, using the CREATE TABLE command.
A specific comparison within a SQL statement (e.g.
WHERE col1 = col2
). If multiple collations are applied to a statement, Snowflake determines the collation to use based on precedence. For more details about precedence, see Collation Precedence in Multi-string Operations (in this topic).
Collation SQL Constructs¶
COLLATE Clause (for Table Column Definitions)¶
Adding the optional COLLATE
clause to the definition of a table column indicates that the specified collation is used for comparisons and other related operations performed on the data in
the column:
CREATE TABLE <table_name> ( <col_name> <col_type> COLLATE '<collation_specification>'
[ , <col_name> <col_type> COLLATE '<collation_specification>' ... ]
[ , ... ]
)
If no COLLATE
clause is specified for a column, Snowflake uses the default, which compares strings based on their UTF-8 character representations.
Also, Snowflake supports specifying an empty string for the collation specification (e.g. COLLATE ''
), which is equivalent to specifying no collation for the column.
However, note that, due to precedence, specifying COLLATE ''
for a column does not have the same effect as explicitly specifying COLLATE 'utf8'
. For more details, see
Collation Precedence in Multi-string Operations (in this topic).
To see whether collation has been specified for the columns in a table, use DESCRIBE TABLE (or use the COLLATION function to view the collation, if any, for a specific column).
COLLATE Function¶
This function uses the specified collation on the input string expression:
COLLATE( <expression> , '[<collation_specification>]' )
This function can also be called using infix notation:
<expression> COLLATE '[<collation_specification>]'
This function is particularly useful for explicitly specifying a particular collation for a particular operation (e.g. sorting), but it can also be used to:
Allow collation in the SELECT clause of a subquery, making all operations on the specified column in the outer query use the collation.
Create a table using CTAS with a specified collation.
For example:
-- Evaluates using "English case-insensitive" collation: SELECT * FROM t1 WHERE COLLATE(col1 , 'en-ci') = 'Tango'; -- Sorts the results using German (Deutsch) collation. SELECT * FROM t1 ORDER BY COLLATE(col1 , 'de'); -- Creates a table with a column using French collation. CREATE TABLE t2 AS SELECT COLLATE(col1, 'fr') AS col1 FROM t1; -- Creates a table with a column using French collation. CREATE TABLE t2 AS SELECT col1 COLLATE 'fr' AS col1 FROM t1;
COLLATION Function¶
This function returns the collation specification used by an expression, including a table column:
COLLATION( <expression> )
If no collation has been specified for the expression, the function returns NULL
.
Typically, if you use this on a column name, you should use DISTINCT
to avoid getting one row of output for each row in the table. For example:
SELECT DISTINCT COLLATION(column1) FROM table1;
Note
This function only returns the collation specification, not its precedence level. For more details about precedence, see Collation Precedence in Multi-string Operations (in this topic).
Collation Specifications¶
When using a COLLATE clause (for a table column) or the COLLATE function (for an expression), you must include a collation specification, which determines the comparison logic used for the column/expression.
A collation specification consists of a string of one or more specifiers separated by a hyphen (-
), in the form of:
'<specifier>[-<specifier> ...]'
The following specifiers are supported (for details, see Supported Specifiers in this topic):
Locale.
Case-sensitivity.
Accent-sensitivity.
Punctuation-sensitivity.
First-letter preference.
Case-conversion.
Space-trimming.
Specifiers are case-insensitive and can be in any order, except for locale, which must always be first, if used.
Specification Examples¶
Some examples of collation specification strings include:
'de'
: German (Deutsch) locale.'de-ci-pi'
: German locale, with case-insensitive and punctuation-insensitive comparisons.'fr_CA-ai'
: Canadian French locale, with accent-insensitive comparisons.'en_US-trim'
: US English locale, with leading spaces and trailing spaces trimmed before the comparison.
You can also specify an empty string for a collation specification (e.g. COLLATE ''
or COLLATE(col1, '')
), which indicates to use no collation.
Supported Specifiers¶
- Locale
Specifies the language-specific and country-specific rules to apply.
Supports valid locale strings, consisting of a language code (required) and country code (optional) in the form of
language_country
. Some locale examples include:en
- English.en_US
- American English.fr
- French.fr_CA
- Canadian French.
In addition, the
utf8
pseudo-locale specifies to use Unicode ordering, which is the default. For more details, see Sorting Using UTF-8 vs Locale Collation (in this topic).The locale specifier is optional, but, if used, must be the first specifier in the string.
- Case-sensitivity
Determines if case should be considered when comparing values. Possible values:
cs
- Case-sensitive (default).ci
- Case-insensitive.
For example:
Collation Specification
Value
Result
'en-ci'
Abc = abc
True
'en-cs'
/en
Abc = abc
False
- Accent-sensitivity
Determines if accented characters should be considered equal to, or different from, their base characters. Possible values:
as
- Accent-sensitive (default).ai
- Accent-insensitive.
For example:
Collation Specification
Value
Result
Notes
'fr-ai'
E = É
True
'fr-as'
/'fr'
E = É
False
'en-ai'
a = ą
True
In English, these letters are treated as having only accent differences, so specifying account-insensitivity results in the values comparing as equal.
'pl-ai'
a = ą
False
In Polish, these letters are treated as separate base letters, so they always compare as unequal regardless of whether accent-insensitivity is specified.
'pl-as'
/'pl'
a = ą
False
Note that the rules for accent-sensitivity and collation vary between languages. For example, in some languages, collation is always accent-sensitive and you cannot turn it off even by specifying accent-insensitive collation.
- Punctuation-sensitivity
Determines if non-letter characters matter. Possible values:
ps
- Punctuation-sensitive.pi
- Punctuation-insensitive.
Note that the default is locale-specific (i.e. if punctuation-sensitivity is not specified, locale-specific rules are used). In most cases, the rules are equivalent to
ps
.For example:
Collation Specification
Value
Result
Notes
'en-pi'
A-B-C = ABC
True
'en-ps'
A-B-C = ABC
False
- First-letter Preference
Determines if, when sorting, uppercase or lowercase letters should be first. Possible values:
fl
- Lowercase letters sorted first.fu
- Uppercase letters sorted first.
Note that the default is locale-specific (i.e. if no value is specified, locale-specific ordering is used). In most cases, the ordering is equivalent to
fl
.Also, this specifier has no impact on equality comparisons.
- Case-conversion
Results in strings being converted to lowercase or uppercase before comparisons. In some situations, this is faster than full locale-specific collation. Possible values:
upper
- Convert the string to uppercase before comparisons.lower
- Convert the string to lowercase before comparisons.
Note that this specifier does not have a default (i.e. if no value is specified, neither of the conversions occurs).
- Space-trimming
Removes leading/trailing spaces from strings before comparisons. This functionality can be useful for performing comparisons equivalent (except in extremely rare corner cases) in semantics to the SQL
CHAR
data type.Possible values:
trim
- Remove both leading and trailing spaces before comparisons.ltrim
- Remove only leading spaces before comparisons.rtrim
- Remove only trailing spaces before comparisons.
Note that this specifier does not have a default (i.e. if no value is specified, trimming is not performed).
For example:
Collation Specification
Value
Result
Notes
'en-trim'
__ABC_ = ABC
True
For the purposes of these examples, underscore characters represent blank spaces.
'en-ltrim'
__ABC_ = ABC
False
'en-rtrim'
__ABC_ = ABC
False
'en'
__ABC_ = ABC
False
Collation Implementation Details¶
Case-Insensitive Comparison¶
Comparing Uppercase Strings vs. Comparing the Original Strings¶
In some languages, two lowercase characters have the same corresponding uppercase character. For example, some languages support both
dotted and undotted forms of lowercase I
(i
vs. ı
). Forcing the strings to uppercase affects comparisons.
The following illustrates the difference:
Create the table:
create or replace table test_table (col1 varchar, col2 varchar); insert into test_table values ('ı', 'i');Query the data:
select col1 = col2, COLLATE(col1, 'lower') = COLLATE(col2, 'lower'), COLLATE(col1, 'upper') = COLLATE(col2, 'upper') from test_table; +-------------+-------------------------------------------------+-------------------------------------------------+ | COL1 = COL2 | COLLATE(COL1, 'LOWER') = COLLATE(COL2, 'LOWER') | COLLATE(COL1, 'UPPER') = COLLATE(COL2, 'UPPER') | |-------------+-------------------------------------------------+-------------------------------------------------| | False | False | True | +-------------+-------------------------------------------------+-------------------------------------------------+
Character Weights¶
Snowflake supports the following collation specifications.
ICU (International Components for Unicode).
Snowflake-specific collation specifications (e.g.
upper
andlower
).
For case-insensitive comparison operations defined by the ICU, Snowflake follows the
Unicode Collation Algorithm (UCA) and considers only the
primary and secondary weights, not the tertiary weights, of Unicode characters. Characters that differ only in their tertiary
weights are treated as identical. For example, using the en-ci
collation specification, a space and a non-breaking space
are considered identical.
Sorting Using UTF-8 vs Locale Collation¶
Strings are always stored internally in Snowflake in UTF-8, and can represent any character in any language supported by UTF-8; therefore, the default collation is UTF-8 (i.e. 'utf8'
).
UTF-8 collation is based on the numeric representation of the character as opposed to the alphabetic order of the character.
This is analogous to sorting by the ordinal value of each ASCII character, which is important to note because uppercase letters have ordinal values lower than lowercase letters:
A = 65
B = 66
...
a = 97
b = 98
...
As a result:
If you sort in UTF-8 order, all uppercase letters are returned before all lowercase letters:
A
,B
, … ,Y
,Z
, … ,a
,b
, … ,y
,z
In contrast, the
'en'
collation specification sorts alphabetically (instead of using the UTF-8 internal representation), resulting in bothA
anda
returned before bothB
andb
:a
,A
,b
,B
, …
Additionally, the differences between the cs
and ci
case-sensitivity specifiers impact sorting:
cs
(case-sensitive) always returns the lowercase version of a letter before the uppercase version of the same letter. For example, using'en-cs'
:a
,A
,b
,B
, …Note that case-sensitive is the default and, therefore,
'en-cs'
and'en'
are equivalent.ci
(case-insensitive) returns uppercase and lowercase versions of letters randomly with respect to each other, but still before both uppercase and lowercase version of later letters. For example, using'en-ci'
:A
,a
,b
,B
, …
Some non-alphabetic characters can also be sorted differently depending upon the collation setting. The following code example shows that
the plus character (+
) and minus character (-
) are sorted differently for different collation settings:
Create the table:
create or replace table demo ( no_explicit_collation VARCHAR, en_ci VARCHAR COLLATE 'en-ci', en VARCHAR COLLATE 'en', utf_8 VARCHAR collate 'utf8'); insert into demo (no_explicit_collation) values ('-'), ('+'); update demo SET en_ci = no_explicit_collation, en = no_explicit_collation, utf_8 = no_explicit_collation;Query the data:
select max(no_explicit_collation), max(en_ci), max(en), max(utf_8) from demo; +----------------------------+------------+---------+------------+ | MAX(NO_EXPLICIT_COLLATION) | MAX(EN_CI) | MAX(EN) | MAX(UTF_8) | |----------------------------+------------+---------+------------| | - | + | + | - | +----------------------------+------------+---------+------------+
Collation Precedence in Multi-string Operations¶
When performing an operation on two (or more) strings, different collations might be specified for different strings. Determining the collation to apply depends on how collation was specified for each input and the precedence of each specifier.
There are 3 precedence levels (from highest to lowest):
- Function
Collation is specified using the COLLATE Function function in a SQL statement.
- Column
Collation was specified in the column definition.
- None
No collation is/was specified for a given expression/column, or collation with an empty specification is/was used (e.g.
COLLATE(col1, '')
orcol1 STRING COLLATE ''
).
When determining the collation to use, the collation specification with the highest precedence is used. If multiple collations are specified and they have the same precedence level, their values are compared, and if they are not equal, an error is returned.
For example, consider a table with the following column-level collation specifications:
CREATE OR REPLACE TABLE collation_precedence_example( col1 VARCHAR, -- equivalent to COLLATE '' col2_fr VARCHAR COLLATE 'fr', -- French locale col3_de VARCHAR COLLATE 'de' -- German locale );
If the table is used in a statement comparing two strings, collation is applied as follows:
-- Uses the 'fr' collation because the precedence for col2_fr is higher than -- the precendence for col1. ... WHERE col1 = col2_fr ... -- Uses the 'en' collation, because it is explicitly specified in the statement, -- which takes precedence over the collation for col2_fr. ... WHERE col1 COLLATE 'en' = col2_fr ... -- Returns an error because the expressions have different collations at the same -- precedence level. ... WHERE col2_fr = col3_de ... -- Uses the 'de' collation because collation for col2_fr has been removed. ... WHERE col2_fr COLLATE '' = col3_de ... -- Returns an error because the expressions have different collations at the same -- precedence level. ... WHERE col2_fr COLLATE 'en' = col3_de COLLATE 'de' ...
Note
Even though Snowflake’s default collation is 'utf8'
, specifying an empty string (or specifying no collation) is
different from explicitly specifying 'utf8'
because explicit collation has higher precedence than no collation.
The last two statements in the code examples below show the difference:
CREATE OR REPLACE TABLE collation_precedence_example2( s1 STRING COLLATE '', s2 STRING COLLATE 'utf8', s3 STRING COLLATE 'fr' ); -- Uses 'utf8' because s1 has no collation and 'utf8' is the default. SELECT * FROM collation_precedence_example2 WHERE s1 = 'a'; -- Uses 'utf8' because s1 has no collation and s2 has explicit 'utf8' collation. SELECT * FROM collation_precedence_example2 WHERE s1 = s2;
This example executes without error because s1 has no collation and s3 has explicit fr
collation, so the
explicit collation takes precedence:
SELECT * FROM collation_precedence_example2 WHERE s1 = s3; +----+----+----+ | S1 | S2 | S3 | |----+----+----| +----+----+----+
This example causes an error because s2 and s3 have different collations specified at the same precedence level:
SELECT * FROM collation_precedence_example2 WHERE s2 = s3;
Output:
002322 (42846): SQL compilation error: Incompatible collations: 'fr' and 'utf8'
Limited Support for Collation in Built-in Functions¶
Collation is supported in only a subset of string functions. Functions that could reasonably be expected to implement
collation, but do not yet support collation, return an error when used with collation. These error messages are
displayed not only when calling the COLLATE
function, but also when calling a
string function on a column that was defined as collated in the CREATE TABLE
or ALTER TABLE
statement
that created that column.
Currently, collation influences only simple comparison operations.
For example, POSITION('abc' in COLLATE('ABC', 'en-ci'))
does not find abc
in ABC
, even though case-insensitive collation is specified.
Functions that Support Collation¶
These functions support collation:
Some of these functions have limitations on their use with collation. For details, see the documentation of each specific function.
This list might expand over time.
Caution
Some SQL operators and predicates, such as ||
(concatenation) and LIKE
, are implemented as functions
(and are available as functions, e.g. LIKE()
and CONCAT()
). If a predicate or operator is implemented as
a function, and the function does not support collation, then the predicate or operator does not support collation.
LIKE
is an example of a predicate that does not support collation.
See also Collation Limitations.
Performance Implications of Using Collation¶
Using collation can affect the performance of various database operations:
Operations involving comparisons might be slower.
This can impact simple WHERE clauses, as well as joins, sorts,
GROUP BY
operations, etc.When used with some functions in WHERE predicates. micro-partition pruning might be less efficient.
Using collation in a WHERE predicate that is different from the collation specified for the column might result in reduced pruning efficiency or completely eliminating pruning.
Additional Considerations for Using Collation¶
Remember that, despite the similarity in their names, the following collation functions return different results:
A column with a collation specification can use characters that are not from the locale for the collation, which might impact sorting.
For example, if a column is created with a
COLLATE 'en'
clause, the data in the column can contain the non-English characterÉ
. In this situation, the characterÉ
is sorted close toE
.You can specify collation operations that are not necessarily meaningful.
For example, you could specify that Polish data is compared to French data using German collation:
SELECT ... WHERE COLLATE(French_column, 'de') = Polish_column;
However, Snowflake does not recommend using the feature this way because it might return unexpected or unintended results.
Once a table column is defined, you cannot change the collation for the column. In other words, after a column has been created with a particular collation using a CREATE TABLE statement, you cannot use ALTER TABLE to change the collation.
However, you can specify a different collation in a DML statement, such as a SELECT statement, that references the column.
Collation Limitations¶
Collation is Supported only for Strings up to 8MB¶
Although the Snowflake VARCHAR data type supports strings up to 16MB, Snowflake supports collation only when the resulting string is 8MB or less. (Some collation operations can lengthen a string.)
Limitations on Collation and UDFs¶
Snowflake does not support collation with UDFs (user-defined functions):
You cannot return a collated string value from a UDF; the server complains that the actual return type is incompatible with the declared return type.
If you pass a collated string value to a UDF, the collation information is not passed; the UDF sees the string as an uncollated string.
Collation Not Supported for Strings in VARIANT, ARRAY, or OBJECT¶
Strings stored inside a VARIANT, OBJECT, or ARRAY do not include a collation specification. Therefore:
Comparison of these values always uses the “utf8” collation.
When a VARCHAR value with a collation specification is used to construct an ARRAY, OBJECT, or VARIANT value, the collation specification is not preserved.
Users can still compare a value stored inside an ARRAY, OBJECT, or VARIANT by extracting the value, casting to VARCHAR, and adding a collation specification. For example:
COLLATE(VARIANT_COL:fld1::VARCHAR, 'en-ci') = VARIANT_COL:fld2::VARCHAR
Examples¶
The following statement creates a table that uses different collation for each column:
CREATE TABLE collation_demo ( uncollated_phrase VARCHAR, utf8_phrase VARCHAR COLLATE 'utf8', english_phrase VARCHAR COLLATE 'en', spanish_phrase VARCHAR COLLATE 'sp' ); INSERT INTO collation_demo (uncollated_phrase, utf8_phrase, english_phrase, spanish_phrase) VALUES ('pinata', 'pinata', 'pinata', 'piñata');
The following query on the table shows the expected values:
SELECT * FROM collation_demo; +-------------------+-------------+----------------+----------------+ | UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE | |-------------------+-------------+----------------+----------------| | pinata | pinata | pinata | piñata | +-------------------+-------------+----------------+----------------+
The following query does not find a match because the character ñ
does not match n
:
SELECT * FROM collation_demo WHERE spanish_phrase = uncollated_phrase; +-------------------+-------------+----------------+----------------+ | UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE | |-------------------+-------------+----------------+----------------| +-------------------+-------------+----------------+----------------+
Changing collation doesn’t force related, but unequal, characters (e.g. ñ
and n
) to be treated as equal:
SELECT * FROM collation_demo WHERE spanish_phrase = uncollated_phrase COLLATE 'sp'; +-------------------+-------------+----------------+----------------+ | UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE | |-------------------+-------------+----------------+----------------| +-------------------+-------------+----------------+----------------+
The following examples demonstrate the effect of collation on sort order:
INSERT INTO collation_demo (spanish_phrase) VALUES ('piña colada'), ('Pinatubo (Mount)'), ('pint'), ('Pinta');SELECT spanish_phrase FROM collation_demo ORDER BY spanish_phrase; +------------------+ | SPANISH_PHRASE | |------------------| | piña colada | | piñata | | Pinatubo (Mount) | | pint | | Pinta | +------------------+
The following query reverses the order of ñ
and n
by changing the
collation to from ‘sp’ (Spanish) to ‘utf8’:
SELECT spanish_phrase FROM collation_demo ORDER BY COLLATE(spanish_phrase, 'utf8'); +------------------+ | SPANISH_PHRASE | |------------------| | Pinatubo (Mount) | | Pinta | | pint | | piña colada | | piñata | +------------------+
This example shows how to use the COLLATION function to view the collation for an expression, such as a column:
CREATE TABLE collation_demo2 (c1 VARCHAR COLLATE 'fr', c2 VARCHAR COLLATE ''); INSERT INTO collation_demo2 (c1, c2) VALUES ('a', 'a'), ('b', 'b');SELECT DISTINCT COLLATION(c1), COLLATION(c2) FROM collation_demo2; +---------------+---------------+ | COLLATION(C1) | COLLATION(C2) | |---------------+---------------| | fr | NULL | +---------------+---------------+
You can also use DESCRIBE TABLE to view collation information about the columns in a table:
DESC TABLE collation_demo2; +------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------+ | name | type | kind | null? | default | primary key | unique key | check | expression | comment | policy name | |------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------| | C1 | VARCHAR(16777216) COLLATE 'fr' | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | C2 | VARCHAR(16777216) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | +------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------+