Clustering Keys & Clustered Tables

In general, Snowflake produces well-clustered data in tables; however, over time, particularly as DML occurs on very large tables (as defined by the amount of data in the table, not the number of rows), the data in some table rows may no longer cluster on desired dimensions.

To improve the clustering of the underlying micro-partitions, you could manually sort rows on key columns and re-insert them into the table; however, these tasks likely would be cumbersome and expensive.

Instead, Snowflake supports automating these tasks by designating one or more columns/expressions in a table as clustering keys. A table with clustering keys defined is considered to be clustered.

Important

Clustering keys are not intended for all tables. The size of a table, as well as the query performance for the table, should dictate whether to define clustering keys for the table. In particular, to see performance improvements from clustering keys, a table has to be large enough to reside on many micro-partitions, and the clustering keys have to provide sufficient filtering to select a subset of these micro-partitions.

In general, tables in the multi-terabyte (TB) size range will experience the most benefit from clustering keys, particularly if you perform regular/continual DML on these tables.

Sections in this topic:

What are Clustering Keys?

Clustering keys are a subset of columns in a table (or expressions on a table) that are explicitly designated to co-locate the table data in the same micro-partitions. This is useful for very large tables where the ordering was not ideal (at the time the data was inserted/loaded) or extensive DML has caused the table’s clustering to degrade.

Some general indicators that can help determine whether to define clustering keys for a table include:

  • Queries on the table are running slower than expected or have noticeably degraded over time.
  • The clustering depth for the table is large.

Clustering keys can be defined when creating a table (using the CREATE TABLE command) or afterward (using the ALTER TABLE command). Clustering keys can also be altered or dropped at any time.

Benefits of Defining Clustering Keys (for Very Large Tables)

Using clustering keys to co-locate similar rows in the same micro-partitions enables several benefits for very large tables, including:

  • Improved scan efficiency in queries by skipping data that does not match filtering predicates.
  • Better column compression than in tables with no clustering keys. This is especially true when other columns are strongly correlated with the clustering keys.
  • After the keys have been defined, no additional administration is required, unless you chose to drop the keys. All future maintenance on the rows in the table (to ensure optimal clustering) is performed automatically by Snowflake.

Note

After you define clustering keys for a table, the rows are not necessarily updated immediately. Snowflake only performs automated maintenance if the table will benefit from the operation. For more details, see Reclustering (in this topic) and Automatic Clustering.

Strategies for Selecting Clustering Keys

Selecting the right columns/expressions as clustering keys can dramatically impact query performance. Analysis of your workload will usually yield some ideal clustering key candidates. For example:

  • If you typically run queries on one column (e.g. a date column), you can use the column as the clustering key for the table.
  • If you typically query a table by two dimensions (e.g. application_id and user_id columns), clustering on those columns can help to improve the query performance for the table.

The number of distinct values (i.e. cardinality) in a column/expression is a critical aspect of selecting it as a clustering key. It is important to choose a clustering key that has:

  • A large enough number of distinct values to enable effective pruning on the table.
  • A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.

A column with very low cardinality (e.g. a column containing gender values) would only yield minimal pruning. In contrast, a column with very high cardinality (e.g. a column containing timestamp or UUID values) typically is not a good candidate to use as a clustering key directly.

Tip

  • If you are defining multiple clustering keys for a table, the order in which the keys are specified in the CLUSTER BY clause is important. As a general rule, Snowflake recommends ordering the keys from lowest cardinality to highest cardinality.

  • If you want to use a column with very high cardinality as a clustering key, we recommend defining the key as an expression on the column, rather than on the column directly, to reduce the number of distinct values.

    For example, a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table). A clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which produces much better pruning results.

Reclustering

As DML operations (INSERT, UPDATE, DELETE, MERGE, COPY) are performed on a clustered table, the data in the table may become less clustered. Periodic/regular reclustering of the table is required to maintain optimal clustering.

During reclustering, Snowflake uses the clustering keys for a clustered table to reorganize the column data, so that related records are relocated to the same micro-partition. This DML operation deletes the affected records and re-inserts them, grouped according to the clustering keys.

Note

Reclustering in Snowflake is automatic; no maintenance is needed. For more details, see Automatic Clustering.

However, for certain accounts, manual reclustering has been deprecated, but is still allowed. For more details see Manual Reclustering.

Credit and Storage Impact of Reclustering

Similar to all DML operations in Snowflake, reclustering consumes credits. The number of credits consumed depends on the size of the table and the amount of data that needs to be reclustered.

Reclustering also results in storage costs. Each time data is reclustered, the rows are physically grouped based on the clustering keys for the table, which results in Snowflake generating new micro-partitions for the table. Adding even a small number of rows to a table can cause all micro-partitions that contain those values to be recreated.

This process can create significant data turnover because the original micro-partitions are marked as deleted, but retained in the system to enable Time Travel and Fail-safe. The original micro-partitions are purged only after both the Time Travel retention period and the subsequent Fail-safe period have passed (i.e. minimum of 8 days and up to 97 days for extended Time Travel, if you are using Snowflake Enterprise Edition or higher). This typically results in increased storage costs. For more information, see Snowflake Time Travel & Fail-safe.

Important

Before defining clustering keys for a table, you should consider the associated credit and storage costs.

Reclustering Example

Building on the clustering diagram from the previous topic, this diagram illustrates how reclustering a table can help reduce scanning of micro-partitions to improve query performance:

Logical table structures after reclustering
  • To start, table t1 is naturally clustered by date across micro-partitions 1-4.
  • The query (in the diagram) requires scanning micro-partitions 1, 2, and 3.
  • date and id are defined as clustering keys. When the table is reclustered, new micro-partitions (5-8) are created.
  • After reclustering, the same query only scans micro-partitions 5 and 6.

In addition, after reclustering:

  • Micro-partition 5 has reached a constant state (i.e. it cannot be improved by reclustering) and is therefore excluded when computing depth and overlap for future maintenance. In a well-clustered large table, most micro-partitions will fall into this category.
  • The original micro-partitions (1-4) are marked as deleted, but are not purged from the system; they are retained for Time Travel and Fail-safe.

Note

This example illustrates the impact of reclustering on an extremely small scale. Extrapolated to a very large table (i.e. consisting of millions of micro-partitions or more), reclustering can have a significant impact on scanning and, therefore, query performance.

Defining Clustered Tables

Calculating the Clustering Information for a Table

Use the system function, SYSTEM$CLUSTERING_INFORMATION, to calculate clustering details, including clustering depth, for a given table. This function can be run on any columns on any table, regardless of whether the table has explicit clustering keys:

  • If a table has explicit clustering keys, the function doesn’t require any input arguments other than the name of the table.
  • If a table doesn’t have explicit clustering keys (or a table has clustering keys, but you want to calculate the ratio on other columns in the table), the function takes the desired column(s) as an additional input argument.

Defining Clustering Keys for a Table

Clustering keys can be defined when a table is created by appending a CLUSTER BY clause to CREATE TABLE:

CREATE TABLE <name> ... CLUSTER BY ( <expr1> [ , <expr2> ... ] )

Where each clustering key exprN is an expression on the table, which can be of any data type, except VARIANT, OBJECT, or ARRAY. Clustering keys can be:

  • Base columns.
  • Expressions on base columns.
  • Expressions on paths in variant columns.

For example:

-- cluster by base columns
CREATE OR REPLACE TABLE t1 (c1 DATE, c2 STRING, c3 NUMBER) CLUSTER BY (c1, c2);

SHOW TABLES LIKE 't1';

+---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------+
| created_on                      | name | database_name | schema_name | kind  | comment | cluster_by | rows | bytes | owner  | retention_time |
|---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------|
| Mon, 08 Aug 2016 15:22:07 -0700 | T1   | TESTDB        | TESTSCHEMA  | TABLE |         | (c1,c2)    |    0 |     0 | PUBLIC | 1              |
+---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------+

-- cluster by expressions
CREATE OR REPLACE TABLE t2 (c1 timestamp, c2 STRING, c3 NUMBER) CLUSTER BY (TO_DATE(C1), substring(c2, 0, 10));

SHOW TABLES LIKE 't2';

+---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+----------+----------------+
|           created_on            | name | database_name | schema_name | kind  | comment |                cluster_by                 | rows | bytes |  owner   | retention_time |
|---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+----------+----------------+
| Fri, 07 Apr 2017 09:46:18 -0700 | T2   | TESTDB        | TESTSCHEMA  | TABLE |         | LINEAR(TO_DATE(C1), SUBSTRING(C2, 0, 10)) | 0    | 0     | PUBLIC   | 1              |
+---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+----------+----------------+

-- cluster by paths in variant columns
CREATE OR REPLACE TABLE T3 (t timestamp, v variant) cluster by (v:"Data":id::number);

SHOW TABLES LIKE 'T3';

+---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+--------+----------------+
|          created_on             | name | database_name | schema_name | kind  | comment |                cluster_by                 | rows | bytes | owner  | retention_time |
|---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+--------+----------------+
| Fri, 23 Jun 1967 00:00:00 -0700 | T3   | TESTDB        | TESTSCHEMA  | TABLE |         | LINEAR(TO_NUMBER(GET_PATH(V, 'Data.id'))) | 0    | 0     | PUBLIC | 1              |
|---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+--------+----------------+

Important Usage Notes

  • If you define two or more clustering keys for a table, the order of the keys has an impact on how the data is clustered in micro-partitions.

    For more details, see Strategies for Selecting Clustering Keys (in this topic).

  • Clustering keys are copied when a table is created using CREATE TABLE … CLONE.

  • Clustering keys are not propagated when a table is created using CREATE TABLE … LIKE.

  • Clustering keys are not supported when a table is created using CREATE TABLE … AS SELECT; however, you can define clustering keys after the table is created.

  • Clustering keys directly on top of VARIANT columns are not supported; however, you can specify clustering keys on VARIANT columns if you provide a path together with the target type.

Changing the Clustering Keys for a Table

At any time, you can add clustering keys to an existing table or change the existing clustering keys for a table using ALTER TABLE:

ALTER TABLE <name> CLUSTER BY ( <expr1> [ , <expr2> ... ] )

For example:

-- cluster by base columns
ALTER TABLE t1 CLUSTER BY (c1, c3);

SHOW TABLES LIKE 't1';

+---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------+
| created_on                      | name | database_name | schema_name | kind  | comment | cluster_by | rows | bytes | owner  | retention_time |
|---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------|
| Mon, 08 Aug 2016 15:22:07 -0700 | T1   | TESTDB        | TESTSCHEMA  | TABLE |         | (c1,c3)    |    0 |     0 | PUBLIC | 1              |
+---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------+

-- cluster by expressions
ALTER TABLE T2 CLUSTER BY (SUBSTRING(C2, 5, 15), TO_DATE(C1));

SHOW TABLES LIKE 't2';
+---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+--------+----------------+
|           created_on            | name | database_name | schema_name | kind  | comment |                cluster_by                 | rows | bytes | owner  | retention_time |
|---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+----------+--------------+
| Fri, 07 Apr 2017 09:46:18 -0700 | T2   | TESTDB        | TESTSCHEMA  | TABLE |         | LINEAR(SUBSTRING(C2, 5, 15), TO_DATE(C1)) | 1    | 1024  | PUBLIC | 1              |
+---------------------------------+------+---------------+-------------+-------+---------+-------------------------------------------+------+-------+--------+----------------+

-- cluster by paths in variant columns
ALTER TABLE T3 CLUSTER BY (v:"Data":name::string, v:"Data":id::number);

SHOW TABLES LIKE 'T3';
+---------------------------------+------+---------------+-------------+-------+---------+------------------------------------------------------------------------------+------+-------+--------+----------------+
|           created_on            | name | database_name | schema_name | kind  | comment |                                  cluster_by                                  | rows | bytes | owner  | retention_time |
|---------------------------------+------+---------------+-------------+-------+---------+------------------------------------------------------------------------------+------+-------+--------+----------------+
| Fri, 09 Jun 2017 11:29:49 -0700 | T3   | TESTDB        | TESTSCHEMA  | TABLE |         | LINEAR(TO_CHAR(GET_PATH(V, 'Data.name')), TO_NUMBER(GET_PATH(V, 'Data.id'))) | 0    | 0     | PUBLIC | 1              |
|---------------------------------+------+---------------+-------------+-------+---------+------------------------------------------------------------------------------+------+-------+--------+----------------+

Note

  • When adding clustering keys to a table already populated with data, not all expressions are allowed to be specified as clustering keys. You can check whether a specific function is supported using:

    show functions like 'function_name';

    The SHOW FUNCTIONS output includes a column, valid_for_clustering, at the end of the output. This column displays whether the function can be used in a clustering key for a populated table.

  • Changing the clustering keys for a table does not affect existing records. To reorganize existing records using the new keys, the table needs to be reclustered using ALTER TABLE … RECLUSTER.

Dropping the Clustering Keys for a Table

At any time, you can drop the clustering keys for a table using ALTER TABLE:

ALTER TABLE <name> DROP CLUSTERING KEY

For example:

ALTER TABLE t1 DROP CLUSTERING KEY;

SHOW TABLES LIKE 't1';

+---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------+
| created_on                      | name | database_name | schema_name | kind  | comment | cluster_by | rows | bytes | owner  | retention_time |
|---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------|
| Mon, 08 Aug 2016 15:22:07 -0700 | T1   | TESTDB        | TESTSCHEMA  | TABLE |         |            |    0 |     0 | PUBLIC | 1              |
+---------------------------------+------+---------------+-------------+-------+---------+------------+------+-------+--------+----------------+