Preparing to Load Data

This topic provides an overview of supported data file formats and data compression. Depending on your data’s structure, you might need to prepare the data before loading it.

In this Topic:

Supported Data Types

See Data Types for descriptions of the data types supported by Snowflake.

Data File Compression

We recommend that you compress your data files when you are loading large data sets.

The following compression algorithms are supported:

  • gzip
  • bzip2
  • Brotli
  • Zstandard v0.8 and higher
  • deflate (with zlib header, RFC1950)
  • raw deflate (without header, RFC1951)

If you are loading data from a local file system, Snowflake automatically gzip-compresses your data files by default. The AUTO_COMPRESS option for the PUT command can be modified when you stage your files.

When loading compressed data, specify the compression method for your data files. The COMPRESSION file format option describes how your data files are already compressed in the stage. Set the COMPRESSION option in one of the following ways:

  • As a file format option specified directly in the COPY INTO table command.
  • As a file format option specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO table command.

Supported File Formats

The following file formats are supported:

Structured/Semi-structured Type Notes
Structured Delimited (CSV, TSV, etc.) Any single-character delimiter is supported; default is comma (i.e. CSV).
Semi-structured JSON  
  Avro Includes automatic detection and processing of Snappy-compressed Avro files.
  ORC  
  Parquet  
  XML Supported as a preview feature.

File format options specify the type of data contained in a file, as well as other related characteristics about the format of the data. The file format options you can specify are different depending on the type of data you plan to load. Snowflake provides a full set of file format option defaults.

Semi-structured Data Loading

Snowflake natively supports semi-structured data, which means semi-structured data can be loaded into relational tables without requiring the definition of a schema in advance. Snowflake supports loading semi-structured data directly into columns of type VARIANT (see Semi-structured Data Types for more details).

Currently supported semi-structured data formats include JSON, Avro, ORC, Parquet, or XML:

  • For JSON, Avro, ORC, and Parquet data, each top-level, complete object is loaded as a separate row in the table. Each object can contain new line characters and spaces as long as the object is valid.
  • For XML data, each top-level element is loaded as a separate row in the table. An element is identified by a start and close tag of the same name.

Typically, tables used to store semi-structured data consist of a single VARIANT column. Once the data is loaded, you can query the data similar to structured data. You can also perform other tasks, such as extracting values and objects from arrays. For more information, see the FLATTEN table function.

Note

Semi-structured data can be loaded into tables with multiple columns, but the semi-structured data must be stored as a field in a structured file (e.g. CSV file). Then, the data can be loaded into a specified column in the table.

Specifying File Format Options

Individual file format options can be specified in any of the following places:

  • In the definition of a table.
  • In the definition of a named stage. For more information, see CREATE STAGE.
  • Directly in a COPY INTO table command when loading data.

In addition, to simplify data loading, Snowflake supports creating named file formats, which are database objects that encapsulate all of the required format information. Named file formats can then be used as input in all the same places where you can specify individual file format options, thereby helping to streamline the data loading process for similarly-formatted data.

Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted data.

Creating a Named File Format

You can create a file format using either the web interface or SQL:

Web Interface:Click on Databases > db_name > File Formats
SQL Command:CREATE FILE FORMAT

For detailed descriptions of all the file format options, see CREATE FILE FORMAT.

Examples

The following example creates a named CSV file format with a specified field delimiter. When this file format is referenced, the COPY command skips the first line in the data files:

CREATE OR REPLACE FILE FORMAT my_csv_format
  TYPE = 'CSV'
  FIELD_DELIMITER = '|'
  SKIP_HEADER = 1;

The following example creates a named JSON file format that instructs the JSON parser to remove outer brackets [ ]:

CREATE OR REPLACE FILE FORMAT my_json_format
  TYPE = 'JSON'
  STRIP_OUTER_ARRAY = TRUE;