Table Design Considerations

This topic provides best practices, general guidelines, and important considerations when designing and managing tables.

In this Topic:

Date/Time Data Types for Columns

When defining columns to contain dates or timestamps, we recommend choosing a date or timestamp data type rather than a character data type. Snowflake stores DATE and TIMESTAMP data more efficiently than VARCHAR, resulting in better query performance. Choose an appropriate date or timestamp data type, depending on the level of granularity required.

Referential Integrity Constraints

Referential integrity constraints in Snowflake are informational and, with the exception of NOT NULL, not enforced. Constraints other than NOT NULL are created as disabled.

However, constraints provide valuable metadata. The primary keys and foreign keys enable members of your project team to orient themselves to the schema design and familiarize themselves with how the tables relate with one another.

Additionally, most business intelligence (BI) and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This approach saves you time and is potentially less prone to error than someone later having to guess how to join the tables and then manually configuring the tool. Basing joins on the primary and foreign keys also helps ensure integrity to the design, since the joins aren’t left to different developers to interpret. Some BI and visualization tools also take advantage of constraint information to rewrite queries into more efficient forms, e.g. join elimination.

Specify a constraint when creating or modifying a table using the CREATE | ALTER TABLE … CONSTRAINT commands.

In the following example, the CREATE TABLE statement for the second table (salesorders) defines an out-of-line foreign key constraint that references a column in the first table (salespeople):

create or replace table salespeople (
  sp_id int not null unique,
  name varchar default null,
  region varchar,
  constraint pk_sp_id primary key (sp_id)
);
create or replace table salesorders (
  order_id int not null unique,
  quantity int default null,
  description varchar,
  sp_id int not null unique,
  constraint pk_order_id primary key (order_id),
  constraint fk_sp_id foreign key (sp_id)
  references salespeople(sp_id)
);

Query the GET_DDL function to retrieve a DDL statement that could be executed to recreate the specified table. The statement includes the constraints currently set on a table.

For example:

select get_ddl('table', 'mydb.public.salesorders');

+-----------------------------------------------------------------------------------------------------+
| GET_DDL('TABLE', 'MYDATABASE.PUBLIC.SALESORDERS')                                                   |
|-----------------------------------------------------------------------------------------------------|
| create or replace TABLE SALESORDERS (                                                               |
|   ORDER_ID NUMBER(38,0) NOT NULL,                                                                   |
|   QUANTITY NUMBER(38,0),                                                                            |
|   DESCRIPTION VARCHAR(16777216),                                                                    |
|   SP_ID NUMBER(38,0) NOT NULL,                                                                      |
|   unique (SP_ID),                                                                                   |
|   constraint PK_ORDER_ID primary key (ORDER_ID),                                                    |
|   constraint FK_SP_ID foreign key (SP_ID) references MYDATABASE.PUBLIC.SALESPEOPLE(SP_ID            |
| );                                                                                                  |
+-----------------------------------------------------------------------------------------------------+

Alternatively, retrieve a list of all table constraints by schema (or across all schemas in a database) by querying the TABLE_CONSTRAINTS View view in the Information Schema.

For example:

select table_name, constraint_type, constraint_name
  from mydb.information_schema.table_constraints
  where constraint_schema = 'PUBLIC'
  Order by table_name;

+-------------+-----------------+-----------------------------------------------------+
| TABLE_NAME  | CONSTRAINT_TYPE | CONSTRAINT_NAME                                     |
|-------------+-----------------+-----------------------------------------------------|
| SALESORDERS | UNIQUE          | SYS_CONSTRAINT_fce2257e-c343-4e66-9bea-fc1c041b00a6 |
| SALESORDERS | FOREIGN KEY     | FK_SP_ID                                            |
| SALESORDERS | PRIMARY KEY     | PK_ORDER_ID                                         |
| SALESORDERS | UNIQUE          | SYS_CONSTRAINT_bf90e2b3-fd4a-4764-9576-88fb487fe989 |
| SALESPEOPLE | PRIMARY KEY     | PK_SP_ID                                            |
+-------------+-----------------+-----------------------------------------------------+

When to Set a Clustering Key

Specifying a clustering key is not necessary for most tables. Snowflake performs automatic tuning via the optimization engine and micro-partitioning. In many cases, data is loaded and organized into micro-partitions by date or timestamp, and is queried along the same dimension.

When should you specify a clustering key for a table? First, know that reclustering a small table typically doesn’t improve query performance significantly.

For larger data sets, you might consider specifying a clustering key and reclustering the table when:

  • The order in which the data is loaded does not match the dimension by which it is most commonly queried, e.g. the data is loaded by date but reports query the data by ID. If your existing scripts or reports query the data by both date and ID (and potentially a third or fourth column), you may see some performance improvement by creating a multi-column clustering key.
  • Query Profile indicates that a significant percentage of the total duration time for typical queries against the table is spent scanning. This applies to queries that filter on one or more specific columns.

Note that reclustering a table requires manually running a DML statement. Reclustering rewrites existing data with a different order. The previous ordering is stored for 7 days to provide Fail-safe protection. Reclustering a table incurs compute costs that correlate to the size of the data that is reordered.

For more information, see Understanding Snowflake Table Structures.

When to Specify Column Lengths

Snowflake compresses column data effectively; therefore, creating columns larger than necessary has minimal impact on the size of data tables. Likewise, there is no query performance difference between a column with a maximum length declaration, e.g. VARCHAR(16777216), and a smaller precision.

We do recommend defining an appropriate column length when the size of your column data is predictable, for the following reasons:

  • Data loading operations are more likely to catch issues such as columns loaded out of order, e.g. a 50-character string loaded erroneously into a VARCHAR(10) column. Such issues produce errors.
  • When the column length is unspecified, some third-party tools may anticipate consuming the maximum size value, which can translate into increased client-side memory usage or unusual behavior.

Storing Semi-structured Data in a VARIANT Column vs. Flattening the Nested Structure

If you aren’t sure yet what types of operations you’ll perform on your semi-structured data, we recommend storing it in a VARIANT column for now. For data that is mostly regular and uses only native types (strings and integers), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar.

For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes:

  • Dates and timestamps, especially non-ISO 8601 dates and timestamps, as string values
  • Numbers within strings
  • Arrays

Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

If you know your use cases for the data, perform tests on a typical data set. Load the data set into a VARIANT column in a table. Use the FLATTEN function to extract the objects and keys you plan to query into a separate table. Run a typical set of queries against both tables to see which structure provides the best performance.

Converting a Permanent Table to a Transient Table

Currently, it isn’t possible to change a permanent table to a transient table using the ALTER TABLE command. The TRANSIENT property is set at table creation and cannot be modified.

To convert an existing table, use a CREATE TABLE AS SELECT (CTAS) statement to create a new transient table that copies the data from the permanent table. You can drop the permanent table after you have confirmed the transient table meets your needs.

For example, create a transient table named MY_TRANSIENT_TABLE that copies all rows from the permanent MYTABLE table:

CREATE TRANSIENT TABLE my_transient_table AS SELECT * FROM mytable;