Text strings in Snowflake are stored using the UTF-8 character set and, by default, strings are compared according to the Unicode codes that represent the characters in the string.
However, comparing strings based on their UTF-8 character representations might not provide the desired/expected behavior. For example:
If special characters in a given language do not sort according to that language’s ordering standards, then sorting might return unexpected results.
You might want the strings to be ordered by other rules, such as ignoring whether the characters are uppercase or lowercase.
Collation allows you to specify alternative rules for comparing strings, which can be used to compare and sort data according to a particular language or other user-specified rules.
In this Topic:
Collation allows you to explicitly specify the rules to use for comparing strings, based on:
Different locales (i.e. different character sets for different languages).
Accent-sensitivity (e.g. whether
Żare considered the same letter or different letters).
Punctuation-sensitivity (i.e. whether comparisons use only letters or include all characters). For example, if a comparison is punctuation-insensitive, then
ABCare treated as equivalent.
Additional options, such as preferences for sorting based on the first letter in a string and trimming of leading and/or trailing blank spaces.
Uses for Collation¶
Collation can be used in a wide variety of operations, including (but not limited to):
Collation control is granular. You can explicitly specify the collation to use for:
Individual columns in tables.
Any comparison within a SQL statement (e.g.
WHERE col1 = col2). If multiple collations are applied to a statement, Snowflake determines the collation to use based on precedence. For more details about precedence, see Collation Precedence in Multi-string Operations (in this topic).
Collation SQL Constructs¶
COLLATE Clause (for Table Column Definitions)¶
Adding the optional
COLLATE clause to the definition of a table column indicates that the specified collation is used for comparisons and other related operations performed on the data in
CREATE TABLE <table_name> ( <col_name> <col_type> COLLATE '<collation_specification>' [ , <col_name> <col_type> COLLATE '<collation_specification>' ... ] [ , ... ] )
COLLATE clause is specified for a column, Snowflake uses the default, which compares strings based on their UTF-8 character representations.
Also, Snowflake supports specifying an empty string for the collation specification (e.g.
COLLATE ''), which is equivalent to specifying no collation for the column.
However, note that, due to precedence, specifying
COLLATE '' for a column does not have the same effect as explicitly specifying
COLLATE 'utf8'. For more details, see
Collation Precedence in Multi-string Operations (in this topic).
This function uses the specified collation on the input string expression:
COLLATE( <expression> , '[<collation_specification>]' )
This function can also be called using infix notation:
<expression> COLLATE '[<collation_specification>]'
This function is particularly useful for explicitly specifying a particular collation for a particular operation (e.g. sorting), but it can also be used to:
Allow collation in the SELECT clause of a subquery, making all operations on the specified column in the outer query use the collation.
Create a table using CTAS with a specified collation.
-- Evaluates using "English case-insensitive" collation: SELECT * FROM table t1 WHERE COLLATE(col1 , 'en-ci') = 'Tango'; -- Sorts the results using German (Deutsch) collation. SELECT * FROM table t1 ORDER BY COLLATE(col1 , 'de'); -- Creates a table with a column using French collation. CREATE TABLE t2 AS SELECT COLLATE(col1, 'fr') AS col1 FROM t1; -- Creates a table with a column using French collation. CREATE TABLE t2 AS SELECT col1 COLLATE 'fr' AS col1 FROM t1;
This function returns the collation specification used by an expression, including a table column:
COLLATION( <expression> )
If no collation has been specified for the expression, the function returns
Typically, if you use this on a column name, you will want to use
DISTINCT to avoid getting one row of output for each row in the table. For example:
SELECT DISTINCT COLLATION(column1) FROM table1;
This function only returns the collation specification, not its precedence level. For more details about precedence, see Collation Precedence in Multi-string Operations (in this topic).
When using a COLLATE clause (for a table column) or the COLLATE function (for an expression), you must include a collation specification, which determines the comparison logic used for the column/expression.
A collation specification consists of a string of one or more specifiers separated by a hyphen (
-), in the form of:
The following specifiers are supported (for details, see Supported Specifiers in this topic):
Specifiers are case-insensitive and can be in any order, except for locale, which must always be first, if used.
Some examples of collation specification strings include:
'de': German (Deutsch) locale.
'de-ci-pi': German locale, with case-insensitive and punctuation-insensitive comparisons.
'fr_CA-ai': Canadian French locale, with accent-insensitive comparisons.
'en_US-trim': US English locale, with leading spaces and trailing spaces trimmed before the comparison.
You can also specify an empty string for a collation specification (e.g.
COLLATE '' or
COLLATE(col1, '')), which indicates to use no collation.
Specifies the language-specific and country-specific rules to apply.
Supports valid locale strings, consisting of a language code (required) and country code (optional) in the form of
language_country. Some locale examples include:
en_US- American English.
fr_CA- Canadian French.
In addition, the
utf8pseudo-locale specifies to use Unicode ordering, which is the default. For more details, see Sorting Using UTF-8 vs Locale Collation (in this topic).
The locale specifier is optional, but, if used, must be the first specifier in the string.
Determines if case should be considered when comparing values. Possible values:
cs- Case-sensitive (default).
Abc = abc
Abc = abc
Determines if accented characters should be considered equal to, or different from, their base characters. Possible values:
as- Accent-sensitive (default).
E = É
E = É
a = ą
In English, these letters are treated as having only accent differences, so specifying account-insensitivity results in the values comparing as equal.
a = ą
In Polish, these letters are treated as separate base letters, so they always compare as unequal regardless of whether accent-insensitivity is specified.
a = ą
Note that the rules for accent-sensitivity and collation vary between languages. For example, in some languages, collation is always accent-sensitive and you cannot turn it off even by specifying accent-insensitive collation.
Determines if non-letter characters matter. Possible values:
Note that the default is locale-specific (i.e. if punctuation-sensitivity is not specified, locale-specific rules are used). In most cases, the rules are equivalent to
A-B-C = ABC
A-B-C = ABC
- First-letter Preference
Determines if, when sorting, uppercase or lowercase letters should be first. Possible values:
fl- Lowercase letters sorted first.
fu- Uppercase letters sorted first.
Note that the default is locale-specific (i.e. if no value is specified, locale-specific ordering is used). In most cases, the ordering is equivalent to
Also, this specifier has no impact on equality comparisons.
Results in strings being converted to lowercase or uppercase before comparisons. In some situations, this is faster than full locale-specific collation. Possible values:
upper- Convert the string to uppercase before comparisons.
lower- Convert the string to lowercase before comparisons.
Note that this specifier does not have a default (i.e. if no value is specified, neither of the conversions occurs).
Removes leading/trailing spaces from strings before comparisons. This functionality can be useful for performing comparisons equivalent (except in extremely rare corner cases) in semantics to the SQL
trim- Remove both leading and trailing spaces before comparisons.
ltrim- Remove only leading spaces before comparisons.
rtrim- Remove only trailing spaces before comparisons.
Note that this specifier does not have a default (i.e. if no value is specified, trimming is not performed).
__ABC_ = ABC
For the purposes of these examples, underscore characters represent blank spaces.
__ABC_ = ABC
__ABC_ = ABC
__ABC_ = ABC
Collation Implementation Details¶
Sorting Using UTF-8 vs Locale Collation¶
Strings are always stored internally in Snowflake in UTF-8, and can represent any character in any language supported by UTF-8; therefore, the default collation is UTF-8 (i.e.
UTF-8 collation is based on the numeric representation of the character as opposed to the alphabetic order of the character.
This is analogous to sorting by the ordinal value of each ASCII character, which is important to note because uppercase letters have ordinal values lower than lowercase letters:
A = 64
B = 66
a = 96
b = 97
As a result:
If you sort in UTF-8 order, all uppercase letters are returned before all lowercase letters:
B, … ,
Z, … ,
b, … ,
In contrast, the
'en'collation specification sorts alphabetically (instead of using the UTF-8 internal representation), resulting in both
areturned before both
Additionally, the differences between the
ci case-sensitivity specifiers impact sorting:
cs(case-sensitive) always returns the lowercase version of a letter before the uppercase version of the same letter. For example, using
Note that case-sensitive is the default and, therefore,
ci(case-insensitive) returns uppercase and lowercase versions of letters randomly with respect to each other, but still before both uppercase and lowercase version of later laters. For example, using
Collation Precedence in Multi-string Operations¶
When performing an operation on two (or more) strings, different collations might be specified for different strings. Determining the collation to apply depends on how collation was specified for each input and the precedence of each specificier.
There are 3 precedence levels (from highest to lowest):
Collation is specified using the COLLATE Function function in a SQL statement.
Collation was specified in the column definition.
No collation is/was specified for a given expression/column, or collation with an empty specification is/was used (e.g.
col1 STRING COLLATE '').
When determining the collation to use, the collation specification with the highest precedence is used. If multiple collations are specified and they have the same precedence level, their values are compared, and if they are not equal, an error is returned.
For example, consider a table with the following column-level collation specifications:
CREATE OR REPLACE TABLE collation_precedence_example( col1 VARCHAR, -- equivalent to COLLATE '' col2_fr VARCHAR COLLATE 'fr', -- French locale col3_de VARCHAR COLLATE 'de' -- German locale );
If the table is used in a statement comparing two strings, collation is applied as follows:
-- Uses the 'fr' collation because the precedence for col2_fr is higher than -- the precendence for col1. ... WHERE col1 = col2_fr ... -- Uses the 'en' collation, because it is explicitly specified in the statement, -- which takes precedence over the collation for col2_fr. ... WHERE col1 COLLATE 'en' = col2_fr ... -- Returns an error because the expressions have different collations at the same -- precedence level. ... WHERE col2_fr = col3_de ... -- Uses the 'de' collation because collation for col2_fr has been removed. ... WHERE col2_fr COLLATE '' = col3_de ... -- Returns an error because the expressions have different collations at the same -- precedence level. ... WHERE col2_fr COLLATE 'en' = col3_de COLLATE 'de' ...
Even though Snowflake’s default collation is
'utf8', specifying an empty string (or specifying no collation) is
different from explicitly specifying
'utf8' because explicit collation has higher precedence than no collation.
The last two statements in the code examples below show the difference:
CREATE TABLE collation1 (v VARCHAR COLLATE 'sp'); INSERT INTO collation1 (v) VALUES ('ñ');
This example works:
SELECT * FROM collation_precedence_example2 WHERE s1 = s3; +----+----+----+ | S1 | S2 | S3 | |----+----+----| +----+----+----+
This example causes an error:
SELECT * FROM collation_precedence_example2 WHERE s2 = s3;
SELECT * FROM collation_precedence_example2 WHERE s2 = s3;
Limited Support for Collation in Built-in Functions¶
Collation is supported in only a subset of string functions. Functions that could reasonably be expected to implement
collation, but do not yet support collation, return an error when used with collation. These error messages are
displayed not only when calling the
COLLATE, but also when calling a
string function on a column that was defined as collated in the
CREATE TABLE or
ALTER TABLE statement
that created that column.
Currently, collation influences only simple comparison operations.
POSITION('abc' in COLLATE('ABC', 'en-ci')) does not find
ABC, even though case-insensitive collation is specified.
Functions that Support Collation¶
These functions support collation:
Some of these functions have limitations on their use with collation. For details, see the documentation of each specific function.
This list might expand over time.
Performance Implications of Using Collation¶
Using collation can affect the performance of various database operations:
Operations involving comparisons might be slower.
This can impact simple WHERE clauses, as well as joins, sorts,
GROUP BYoperations, etc.
When used with some functions in WHERE predicates. micro-partition pruning might be less efficient.
Using collation in a WHERE predicate that is different from the collation specified for the column might result in reduced pruning efficiency or completely eliminating pruning.
Additional Considerations for Using Collation¶
Remember that, despite the similarity in their names, the following collation functions return different results:
A column with a collation specification can use characters that are not from the locale for the collation, which might impact sorting.
For example, if a column is created with a
COLLATE 'en'clause, the data in the column can contain the non-English character
É. In this situation, the character
Éis sorted close to
You can specify collation operations that are not necessarily meaningful.
For example, you could specify that Polish data is compared to French data using German collation:
SELECT ... WHERE COLLATE(French_column, 'de') = Polish_column;
However, Snowflake does not recommend using the feature this way because it might return unexpected or unintended results.
Once a table column is defined, you cannot change the collation for the column. In other words, after a column has been created with a particular collation using a CREATE TABLE statement, you cannot use ALTER TABLE to change the collation.
However, you can specify a different collation in a DML statement, such as a SELECT statement, that references the column.
Limitations on Collation and UDFs¶
Collation and UDFs (user-defined functions) do not always work together. For example, you cannot return a collated string value from a UDF; the server complains that the actual return type is incompatible with the declared return type. You also cannot pass a collated string value to a UDF.
Some of these restrictions might be removed in the future.
Collation Not Supported for Strings in VARIANT, ARRAY, or OBJECT¶
Strings stored inside a VARIANT, OBJECT, or ARRAY do not include a collation specification. Therefore:
Comparison of these values always uses the “utf8” collation.
When a VARCHAR value with a collation specification is used to construct an ARRAY, OBJECT, or VARIANT value, the collation specification is not preserved.
Users can still compare a value stored inside an ARRAY, OBJECT, or VARIANT by extracting the value, casting to VARCHAR, and adding a collation specification, e.g.:
COLLATE(VARIANT_COL:fld1::VARCHAR, 'en-ci') = VARIANT_COL:fld2::VARCHAR
The following statement creates a table that uses different collation for each column:
CREATE TABLE collation_demo ( uncollated_phrase VARCHAR, utf8_phrase VARCHAR COLLATE 'utf8', english_phrase VARCHAR COLLATE 'en', spanish_phrase VARCHAR COLLATE 'sp' ); INSERT INTO collation_demo (uncollated_phrase, utf8_phrase, english_phrase, spanish_phrase) VALUES ('pinata', 'pinata', 'pinata', 'piñata');
The following query on the table shows the expected values:
SELECT * FROM collation_demo; +-------------------+-------------+----------------+----------------+ | UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE | |-------------------+-------------+----------------+----------------| | pinata | pinata | pinata | piñata | +-------------------+-------------+----------------+----------------+
The following query does not find a match because the character
ñ does not match
SELECT * FROM collation_demo WHERE spanish_phrase = uncollated_phrase; +-------------------+-------------+----------------+----------------+ | UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE | |-------------------+-------------+----------------+----------------| +-------------------+-------------+----------------+----------------+
Changing collation doesn’t force related, but unequal, characters (e.g.
n) to be treated as equal:
SELECT * FROM collation_demo WHERE spanish_phrase = uncollated_phrase COLLATE 'sp'; +-------------------+-------------+----------------+----------------+ | UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE | |-------------------+-------------+----------------+----------------| +-------------------+-------------+----------------+----------------+
The following examples demonstrate the effect of collation on sort order:
INSERT INTO collation_demo (spanish_phrase) VALUES ('piña colada'), ('Pinatubo (Mount)'), ('pint'), ('Pinta');SELECT spanish_phrase FROM collation_demo ORDER BY spanish_phrase; +------------------+ | SPANISH_PHRASE | |------------------| | piña colada | | piñata | | Pinatubo (Mount) | | pint | | Pinta | +------------------+
The following query reverses the order of
n by changing the
collation to from ‘sp’ (Spanish) to ‘utf8’:
SELECT spanish_phrase FROM collation_demo ORDER BY COLLATE(spanish_phrase, 'utf8'); +------------------+ | SPANISH_PHRASE | |------------------| | Pinatubo (Mount) | | Pinta | | pint | | piña colada | | piñata | +------------------+
This example shows how to use the COLLATION function to view the collation for an expression, such as a column:
CREATE TABLE collation_demo2 (c1 VARCHAR COLLATE 'fr', c2 VARCHAR COLLATE ''); INSERT INTO collation_demo2 (c1, c2) VALUES ('a', 'a'), ('b', 'b');SELECT DISTINCT COLLATION(c1), COLLATION(c2) FROM collation_demo2; +---------------+---------------+ | COLLATION(C1) | COLLATION(C2) | |---------------+---------------| | fr | NULL | +---------------+---------------+
You can also use DESCRIBE TABLE to view collation information about the columns in a table:
DESC TABLE collation_demo2; +------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------+ | name | type | kind | null? | default | primary key | unique key | check | expression | comment | |------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------| | C1 | VARCHAR(16777216) COLLATE 'fr' | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | | C2 | VARCHAR(16777216) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | +------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------+