Collation Support

Text strings in Snowflake are stored using the UTF-8 character set and, by default, strings are compared according to the Unicode codes that represent the characters in the string.

However, comparing strings based on their UTF-8 character representations might not provide the desired/expected behavior. For example:

  • If special characters in a given language do not sort according to that language’s ordering standards, then sorting might return unexpected results.

  • You might want the strings to be ordered by other rules, such as ignoring whether the characters are uppercase or lowercase.

Collation allows you to specify alternative rules for comparing strings, which can be used to compare and sort data according to a particular language or other user-specified rules.

In this Topic:

Overview

Collation allows you to explicitly specify the rules to use for comparing strings, based on:

  • Different locales (i.e. different character sets for different languages).

  • Case-sensitivity (i.e. whether to use case-sensitive or case-insensitive string comparisons without explicitly calling the UPPER or LOWER functions to convert the strings).

  • Accent-sensitivity (e.g. whether Z, Ź, and Ż are considered the same letter or different letters).

  • Punctuation-sensitivity (i.e. whether comparisons use only letters or include all characters). For example, if a comparison is punctuation-insensitive, then A-B-C and ABC are treated as equivalent.

  • Additional options, such as preferences for sorting based on the first letter in a string and trimming of leading and/or trailing blank spaces.

Uses for Collation

Collation can be used in a wide variety of operations, including (but not limited to):

Usage

Example

Link

Simple comparison

... WHERE column1 = column2 ...

WHERE

Joins

... ON table1.column1 = table2.column2 ...

JOIN

Sorting

... ORDER BY column1 ...

ORDER BY

Top-K sorting

... ORDER BY column1 LIMIT N ...

LIMIT / FETCH

Aggregation

... GROUP BY ...

GROUP BY

Window clauses

... PARTITION BY ... ORDER BY ...

Window Functions

Scalar functions

... LEAST(column1, column2, column3) ...

Scalar Functions

Aggregate functions

... MIN(column1), MAX(column1) ...

Aggregate Functions

Data clustering

... CLUSTER BY (column1) ...

Clustering Keys & Clustered Tables

Collation Control

Collation control is granular. You can explicitly specify the collation to use for:

  • Individual columns in tables.

  • Any comparison within a SQL statement (e.g. WHERE col1 = col2). If multiple collations are applied to a statement, Snowflake determines the collation to use based on precedence. For more details about precedence, see Collation Precedence in Multi-string Operations (in this topic).

Collation SQL Constructs

COLLATE Clause (for Table Column Definitions)

Adding the optional COLLATE clause to the definition of a table column indicates that the specified collation is used for comparisons and other related operations performed on the data in the column:

CREATE TABLE <table_name> ( <col_name> <col_type> COLLATE '<collation_specification>'
                           [ , <col_name> <col_type> COLLATE '<collation_specification>' ... ]
                           [ , ... ] )

If no COLLATE clause is specified for a column, Snowflake uses the default, which compares strings based on their UTF-8 character representations.

Also, Snowflake supports specifying an empty string for the collation specification (e.g. COLLATE ''), which is equivalent to specifying no collation for the column.

However, note that, due to precedence, specifying COLLATE '' for a column does not have the same effect as explicitly specifying COLLATE 'utf8'. For more details, see Collation Precedence in Multi-string Operations (in this topic).

To see whether collation has been specified for the columns in a table, use DESCRIBE TABLE (or use the COLLATION function to view the collation, if any, for a specific column).

COLLATE Function

This function uses the specified collation on the input string expression:

COLLATE( <expression> , '[<collation_specification>]' )

This function can also be called using infix notation:

<expression> COLLATE '[<collation_specification>]'

This function is particularly useful for explicitly specifying a particular collation for a particular operation (e.g. sorting), but it can also be used to:

  • Allow collation in the SELECT clause of a subquery, making all operations on the specified column in the outer query use the collation.

  • Create a table using CTAS with a specified collation.

For example:

-- Evaluates using "English case-insensitive" collation:
SELECT * FROM table t1 WHERE COLLATE(col1 , 'en-ci') = 'Tango';

-- Sorts the results using German (Deutsch) collation.
SELECT * FROM table t1 ORDER BY COLLATE(col1 , 'de');

-- Creates a table with a column using French collation.
CREATE TABLE t2 AS SELECT COLLATE(col1, 'fr') AS col1 FROM t1;

-- Creates a table with a column using French collation.
CREATE TABLE t2 AS SELECT col1 COLLATE 'fr' AS col1 FROM t1;

COLLATION Function

This function returns the collation specification used by an expression, including a table column:

COLLATION( <expression> )

If no collation has been specified for the expression, the function returns NULL.

Typically, if you use this on a column name, you will want to use DISTINCT to avoid getting one row of output for each row in the table. For example:

SELECT DISTINCT COLLATION(column1) FROM table1;

Note

This function only returns the collation specification, not its precedence level. For more details about precedence, see Collation Precedence in Multi-string Operations (in this topic).

Collation Specifications

When using a COLLATE clause (for a table column) or the COLLATE function (for an expression), you must include a collation specification, which determines the comparison logic used for the column/expression.

A collation specification consists of a string of one or more specifiers separated by a hyphen (-), in the form of:

'<specifier>[-<specifier> ...]'

The following specifiers are supported (for details, see Supported Specifiers in this topic):

  • Locale.

  • Case-sensitivity.

  • Accent-sensitivity.

  • Punctuation-sensitivity.

  • First-letter preference.

  • Case-conversion.

  • Space-trimming.

Specifiers are case-insensitive and can be in any order, except for locale, which must always be first, if used.

Specification Examples

Some examples of collation specification strings include:

  • 'de': German (Deutsch) locale.

  • 'de-ci-pi': German locale, with case-insensitive and punctuation-insensitive comparisons.

  • 'fr_CA-ai': Canadian French locale, with accent-insensitive comparisons.

  • 'en_US-trim': US English locale, with leading spaces and trailing spaces trimmed before the comparison.

You can also specify an empty string for a collation specification (e.g. COLLATE '' or COLLATE(col1, '')), which indicates to use no collation.

Supported Specifiers

Locale

Specifies the language-specific and country-specific rules to apply.

Supports valid locale strings, consisting of a language code (required) and country code (optional) in the form of language_country. Some locale examples include:

  • en - English.

  • en_US - American English.

  • fr - French.

  • fr_CA - Canadian French.

In addition, the utf8 pseudo-locale specifies to use Unicode ordering, which is the default. For more details, see Sorting Using UTF-8 vs Locale Collation (in this topic).

The locale specifier is optional, but, if used, must be the first specifier in the string.

Case-sensitivity

Determines if case should be considered when comparing values. Possible values:

  • cs - Case-sensitive (default).

  • ci - Case-insensitive.

For example:

Collation Specification

Value

Result

'en-ci'

Abc = abc

True

'en-cs' / en

Abc = abc

False

Accent-sensitivity

Determines if accented characters should be considered equal to, or different from, their base characters. Possible values:

  • as - Accent-sensitive (default).

  • ai - Accent-insensitive.

For example:

Collation Specification

Value

Result

Notes

'fr-ai'

E = É

True

'fr-as' / 'fr'

E = É

False

'en-ai'

a = ą

True

In English, these letters are treated as having only accent differences, so specifying account-insensitivity results in the values comparing as equal.

'pl-ai'

a = ą

False

In Polish, these letters are treated as separate base letters, so they always compare as unequal regardless of whether accent-insensitivity is specified.

'pl-as' / 'pl'

a = ą

False

Note that the rules for accent-sensitivity and collation vary between languages. For example, in some languages, collation is always accent-sensitive and you cannot turn it off even by specifying accent-insensitive collation.

Punctuation-sensitivity

Determines if non-letter characters matter. Possible values:

  • ps - Punctuation-sensitive.

  • pi - Punctuation-insensitive.

Note that the default is locale-specific (i.e. if punctuation-sensitivity is not specified, locale-specific rules are used). In most cases, the rules are equivalent to ps.

For example:

Collation Specification

Value

Result

Notes

'en-pi'

A-B-C = ABC

True

'en-ps'

A-B-C = ABC

False

First-letter Preference

Determines if, when sorting, uppercase or lowercase letters should be first. Possible values:

  • fl - Lowercase letters sorted first.

  • fu - Uppercase letters sorted first.

Note that the default is locale-specific (i.e. if no value is specified, locale-specific ordering is used). In most cases, the ordering is equivalent to fl.

Also, this specifier has no impact on equality comparisons.

Case-conversion

Results in strings being converted to lowercase or uppercase before comparisons. In some situations, this is faster than full locale-specific collation. Possible values:

  • upper - Convert the string to uppercase before comparisons.

  • lower - Convert the string to lowercase before comparisons.

Note that this specifier does not have a default (i.e. if no value is specified, neither of the conversions occurs).

Space-trimming

Removes leading/trailing spaces from strings before comparisons. This functionality can be useful for performing comparisons equivalent (except in extremely rare corner cases) in semantics to the SQL CHAR data type.

Possible values:

  • trim - Remove both leading and trailing spaces before comparisons.

  • ltrim - Remove only leading spaces before comparisons.

  • rtrim - Remove only trailing spaces before comparisons.

Note that this specifier does not have a default (i.e. if no value is specified, trimming is not performed).

For example:

Collation Specification

Value

Result

Notes

'en-trim'

__ABC_ = ABC

True

For the purposes of these examples, underscore characters represent blank spaces.

'en-ltrim'

__ABC_ = ABC

False

'en-rtrim'

__ABC_ = ABC

False

'en'

__ABC_ = ABC

False

Collation Implementation Details

Sorting Using UTF-8 vs Locale Collation

Strings are always stored internally in Snowflake in UTF-8, and can represent any character in any language supported by UTF-8; therefore, the default collation is UTF-8 (i.e. 'utf8').

UTF-8 collation is based on the numeric representation of the character as opposed to the alphabetic order of the character.

This is analogous to sorting by the ordinal value of each ASCII character, which is important to note because uppercase letters have ordinal values lower than lowercase letters:

A = 64
B = 66
...
a = 96
b = 97
...

As a result:

  • If you sort in UTF-8 order, all uppercase letters are returned before all lowercase letters:

    A , B , … , Y , Z , … , a , b , … , y , z

  • In contrast, the 'en' collation specification sorts alphabetically (instead of using the UTF-8 internal representation), resulting in both A and a returned before both B and b:

    a , A , b , B , …

Additionally, the differences between the cs and ci case-sensitivity specifiers impact sorting:

  • cs (case-sensitive) always returns the lowercase version of a letter before the uppercase version of the same letter. For example, using 'en-cs':

    a , A , b , B , …

    Note that case-sensitive is the default and, therefore, 'en-cs' and 'en' are equivalent.

  • ci (case-insensitive) returns uppercase and lowercase versions of letters randomly with respect to each other, but still before both uppercase and lowercase version of later laters. For example, using 'en-ci':

    A , a , b , B , …

Collation Precedence in Multi-string Operations

When performing an operation on two (or more) strings, different collations might be specified for different strings. Determining the collation to apply depends on how collation was specified for each input and the precedence of each specificier.

There are 3 precedence levels (from highest to lowest):

Function

Collation is specified using the COLLATE Function function in a SQL statement.

Column

Collation was specified in the column definition.

None

No collation is/was specified for a given expression/column, or collation with an empty specification is/was used (e.g. COLLATE(col1, '') or col1 STRING COLLATE '').

When determining the collation to use, the collation specification with the highest precedence is used. If multiple collations are specified and they have the same precedence level, their values are compared, and if they are not equal, an error is returned.

For example, consider a table with the following column-level collation specifications:

CREATE OR REPLACE TABLE collation_precedence_example(
  col1    VARCHAR,               -- equivalent to COLLATE ''
  col2_fr VARCHAR COLLATE 'fr',  -- French locale
  col3_de VARCHAR COLLATE 'de'   -- German locale
);

If the table is used in a statement comparing two strings, collation is applied as follows:

-- Uses the 'fr' collation because the precedence for col2_fr is higher than
-- the precendence for col1.
... WHERE col1 = col2_fr ...

-- Uses the 'en' collation, because it is explicitly specified in the statement,
-- which takes precedence over the collation for col2_fr.
... WHERE col1 COLLATE 'en' = col2_fr ...

-- Returns an error because the expressions have different collations at the same
-- precedence level.
... WHERE col2_fr = col3_de ...

-- Uses the 'de' collation because collation for col2_fr has been removed.
... WHERE col2_fr COLLATE '' = col3_de ...

-- Returns an error because the expressions have different collations at the same
-- precedence level.
... WHERE col2_fr COLLATE 'en' = col3_de COLLATE 'de' ...

Note

Even though Snowflake’s default collation is 'utf8', specifying an empty string (or specifying no collation) is different from explicitly specifying 'utf8' because explicit collation has higher precedence than no collation. The last two statements in the code examples below show the difference:

CREATE TABLE collation1 (v VARCHAR COLLATE 'sp');
INSERT INTO collation1 (v) VALUES ('ñ');

This example works:

SELECT * FROM collation_precedence_example2 WHERE s1 = s3;
+----+----+----+
| S1 | S2 | S3 |
|----+----+----|
+----+----+----+

This example causes an error:

SELECT * FROM collation_precedence_example2 WHERE s2 = s3;

Output:

SELECT * FROM collation_precedence_example2 WHERE s2 = s3;

Limited Support for Collation in Built-in Functions

Collation is supported in only a subset of string functions. Functions that could reasonably be expected to implement collation, but do not yet support collation, return an error when used with collation. These error messages are displayed not only when calling the COLLATE, but also when calling a string function on a column that was defined as collated in the CREATE TABLE or ALTER TABLE statement that created that column.

Currently, collation influences only simple comparison operations.

For example, POSITION('abc' in COLLATE('ABC', 'en-ci')) does not find abc in ABC, even though case-insensitive collation is specified.

Functions that Support Collation

These functions support collation:

Some of these functions have limitations on their use with collation. For details, see the documentation of each specific function.

This list might expand over time.

Performance Implications of Using Collation

Using collation can affect the performance of various database operations:

  • Operations involving comparisons might be slower.

    This can impact simple WHERE clauses, as well as joins, sorts, GROUP BY operations, etc.

  • When used with some functions in WHERE predicates. micro-partition pruning might be less efficient.

  • Using collation in a WHERE predicate that is different from the collation specified for the column might result in reduced pruning efficiency or completely eliminating pruning.

Additional Considerations for Using Collation

  • Remember that, despite the similarity in their names, the following collation functions return different results:

    • COLLATE explicitly specifies which collation to use.

    • COLLATION shows which collation is used if none is specified explicitly.

  • A column with a collation specification can use characters that are not from the locale for the collation, which might impact sorting.

    For example, if a column is created with a COLLATE 'en' clause, the data in the column can contain the non-English character É. In this situation, the character É is sorted close to E.

  • You can specify collation operations that are not necessarily meaningful.

    For example, you could specify that Polish data is compared to French data using German collation:

    SELECT ... WHERE COLLATE(French_column, 'de') = Polish_column;
    

    However, Snowflake does not recommend using the feature this way because it might return unexpected or unintended results.

  • Once a table column is defined, you cannot change the collation for the column. In other words, after a column has been created with a particular collation using a CREATE TABLE statement, you cannot use ALTER TABLE to change the collation.

    However, you can specify a different collation in a DML statement, such as a SELECT statement, that references the column.

Collation Limitations

Limitations on Collation and UDFs

Collation and UDFs (user-defined functions) do not always work together. For example, you cannot return a collated string value from a UDF; the server complains that the actual return type is incompatible with the declared return type. You also cannot pass a collated string value to a UDF.

Some of these restrictions might be removed in the future.

Collation Not Supported for Strings in VARIANT, ARRAY, or OBJECT

Strings stored inside a VARIANT, OBJECT, or ARRAY do not include a collation specification. Therefore:

  • Comparison of these values always uses the “utf8” collation.

  • When a VARCHAR value with a collation specification is used to construct an ARRAY, OBJECT, or VARIANT value, the collation specification is not preserved.

  • Users can still compare a value stored inside an ARRAY, OBJECT, or VARIANT by extracting the value, casting to VARCHAR, and adding a collation specification, e.g.:

    COLLATE(VARIANT_COL:fld1::VARCHAR, 'en-ci') = VARIANT_COL:fld2::VARCHAR
    

Examples

The following statement creates a table that uses different collation for each column:

CREATE TABLE collation_demo (
  uncollated_phrase VARCHAR, 
  utf8_phrase VARCHAR COLLATE 'utf8',
  english_phrase VARCHAR COLLATE 'en',
  spanish_phrase VARCHAR COLLATE 'sp'
  );

INSERT INTO collation_demo (uncollated_phrase, utf8_phrase, english_phrase, spanish_phrase) 
   VALUES ('pinata', 'pinata', 'pinata', 'piñata');

The following query on the table shows the expected values:

SELECT * FROM collation_demo;
+-------------------+-------------+----------------+----------------+
| UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE |
|-------------------+-------------+----------------+----------------|
| pinata            | pinata      | pinata         | piñata         |
+-------------------+-------------+----------------+----------------+

The following query does not find a match because the character ñ does not match n:

SELECT * FROM collation_demo WHERE spanish_phrase = uncollated_phrase;
+-------------------+-------------+----------------+----------------+
| UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE |
|-------------------+-------------+----------------+----------------|
+-------------------+-------------+----------------+----------------+

Changing collation doesn’t force related, but unequal, characters (e.g. ñ and n) to be treated as equal:

SELECT * FROM collation_demo 
    WHERE spanish_phrase = uncollated_phrase COLLATE 'sp';
+-------------------+-------------+----------------+----------------+
| UNCOLLATED_PHRASE | UTF8_PHRASE | ENGLISH_PHRASE | SPANISH_PHRASE |
|-------------------+-------------+----------------+----------------|
+-------------------+-------------+----------------+----------------+

The following examples demonstrate the effect of collation on sort order:

INSERT INTO collation_demo (spanish_phrase) VALUES
   ('piña colada'),
   ('Pinatubo (Mount)'),
   ('pint'),
   ('Pinta');
SELECT spanish_phrase FROM collation_demo 
  ORDER BY spanish_phrase;
+------------------+
| SPANISH_PHRASE   |
|------------------|
| piña colada      |
| piñata           |
| Pinatubo (Mount) |
| pint             |
| Pinta            |
+------------------+

The following query reverses the order of ñ and n by changing the collation to from ‘sp’ (Spanish) to ‘utf8’:

SELECT spanish_phrase FROM collation_demo 
  ORDER BY COLLATE(spanish_phrase, 'utf8');
+------------------+
| SPANISH_PHRASE   |
|------------------|
| Pinatubo (Mount) |
| Pinta            |
| pint             |
| piña colada      |
| piñata           |
+------------------+

This example shows how to use the COLLATION function to view the collation for an expression, such as a column:

CREATE TABLE collation_demo2 (c1 VARCHAR COLLATE 'fr', c2 VARCHAR COLLATE '');
INSERT INTO collation_demo2 (c1, c2) VALUES
    ('a', 'a'),
    ('b', 'b');
SELECT DISTINCT COLLATION(c1), COLLATION(c2) FROM collation_demo2;
+---------------+---------------+
| COLLATION(C1) | COLLATION(C2) |
|---------------+---------------|
| fr            | NULL          |
+---------------+---------------+

You can also use DESCRIBE TABLE to view collation information about the columns in a table:

DESC TABLE collation_demo2;
+------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------+
| name | type                           | kind   | null? | default | primary key | unique key | check | expression | comment |
|------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------|
| C1   | VARCHAR(16777216) COLLATE 'fr' | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| C2   | VARCHAR(16777216)              | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
+------+--------------------------------+--------+-------+---------+-------------+------------+-------+------------+---------+