Preparing to Load Data

This topic provides an overview of supported data file formats and data compression. Depending on your data’s structure, you might need to prepare the data before loading it.

In this Topic:

Supported Data Types

See Data Types for descriptions of the data types supported by Snowflake.

Data File Compression

We recommend that you compress your data files when you are loading large data sets.

The following compression algorithms are supported:

  • gzip
  • bzip2
  • Brotli
  • Zstandard v0.8 and higher
  • deflate (with zlib header, RFC1950)
  • raw deflate (without header, RFC1951)

If you are loading data from a local file system, Snowflake automatically gzip-compresses your data files by default. The AUTO_COMPRESS option for the PUT command can be modified when you stage your files.

When loading compressed data, specify the compression method for your data files. The COMPRESSION file format option describes how your data files are already compressed in the stage. Set the COMPRESSION option in one of the following ways:

  • As a file format option specified directly in the COPY INTO <table> statement.
  • As a file format option specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO <table> statement.

Supported File Formats

The following file formats are supported:

Structured/Semi-structured Type Notes
Structured Delimited (CSV, TSV, etc.) Any valid single-byte delimiter is supported; default is comma (i.e. CSV).
Semi-structured JSON  
  Avro Includes automatic detection and processing of Snappy-compressed Avro files.
  ORC Includes automatic detection and processing of Snappy- or zlib-compressed ORC files.
  Parquet Includes automatic detection and processing of Snappy-compressed Parquet files.
  XML Supported as a preview feature.

File format options specify the type of data contained in a file, as well as other related characteristics about the format of the data. The file format options you can specify are different depending on the type of data you plan to load. Snowflake provides a full set of file format option defaults.

Leading Spaces

If your external database software encloses fields in quotes but inserts a leading space, Snowflake reads the leading space rather than the opening quotation character as the beginning of the field. The quotation characters are interpreted as string data.

Use the TRIM_SPACE file format option to remove undesirable spaces during the data load.

For example, fields in the following CSV file include a leading space:

"value1", "value2", "value3"

The following COPY command trims the leading space and removes the quotation marks enclosing each field:

COPY INTO mytable
FROM @%mytable
FILE_FORMAT = (TYPE = CSV TRIM_SPACE=true FIELD_OPTIONALLY_ENCLOSED_BY = '0x22');

SELECT * FROM mytable;

+--------+--------+--------+
| col1   | col2   | col3   |
+--------+--------+--------+
| value1 | value2 | value3 |
+--------+--------+--------+

Semi-structured File Formats

Snowflake natively supports semi-structured data, which means semi-structured data can be loaded into relational tables without requiring the definition of a schema in advance. Snowflake supports loading semi-structured data directly into columns of type VARIANT (see Semi-structured Data Types for more details).

Currently supported semi-structured data formats include JSON, Avro, ORC, Parquet, or XML:

  • For JSON, Avro, ORC, and Parquet data, each top-level, complete object is loaded as a separate row in the table. Each object can contain new line characters and spaces as long as the object is valid.
  • For XML data, each top-level element is loaded as a separate row in the table. An element is identified by a start and close tag of the same name.

Typically, tables used to store semi-structured data consist of a single VARIANT column. Once the data is loaded, you can query the data similar to structured data. You can also perform other tasks, such as extracting values and objects from arrays. For more information, see the FLATTEN table function.

Note

Semi-structured data can be loaded into tables with multiple columns, but the semi-structured data must be stored as a field in a structured file (e.g. CSV file). Then, the data can be loaded into a specified column in the table.

Named File Formats

Snowflake supports creating named file formats, which are database objects that encapsulate all of the required format information. Named file formats can then be used as input in all the same places where you can specify individual file format options, thereby helping to streamline the data loading process for similarly-formatted data.

Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted data.

Creating a Named File Format

You can create a file format using either the web interface or SQL:

Web Interface:Click on Databases Databases tab » <db_name> » File Formats
SQL:CREATE FILE FORMAT

For descriptions of all file format options and the default values, see CREATE FILE FORMAT.

Overriding Default File Format Options

You can define the file format settings for your staged data (i.e. override the default settings) in any of the following locations:

In the table definition:
 Explicitly set the options using the FILE_FORMAT parameter. For more information, see CREATE TABLE.
In the named stage definition:
 Explicitly set the options using the FILE_FORMAT parameter. The stage is then referenced in the COPY INTO TABLE statement. For more information, see CREATE STAGE.
Directly in the COPY INTO TABLE statement when loading data:
 Explicitly set the options separately. For more information, see COPY INTO <table>.

If file format options are specified in multiple locations, the load operation applies the options in the following order of precedence:

  1. COPY INTO TABLE statement.
  2. Stage definition.
  3. Table definition.

Note

File format options set in multiple locations are not cumulative. Any options set in one place override all options (whether the same or different options) set lower in the order of precedence.

Supported Copy Options

Copy options determine the behavior of a data load with regard to error handling, maximum data size, and so on.

For descriptions of all copy options and the default values, see COPY INTO <table>.

Overriding Default Copy Options

You can specify the desired the load behavior (i.e. override the default settings) in any of the following locations:

In the table definition:
 Explicitly set the options using the STAGE_COPY_OPTIONS parameter. For more information, see CREATE TABLE.
In the named stage definition:
 Explicitly set the options using the COPY_OPTIONS parameter. The stage is then referenced in the COPY INTO TABLE statement. For more information, see CREATE STAGE.
Directly in the COPY INTO TABLE statement when loading data:
 Explicitly set the options separately. For more information, see COPY INTO <table>.

If copy options are specified in multiple locations, the load operation applies the options in the following order of precedence:

  1. COPY INTO TABLE statement.
  2. Stage definition.
  3. Table definition.

Note

Copy options set in multiple locations are cumulative. Individual options set in one place override the same option set lower in the order of precedence.