Preparing to Load Data¶
This topic provides an overview of supported data file formats and data compression. Depending on your data’s structure, you might need to prepare the data before loading it.
In this Topic:
Data File Compression¶
We recommend that you compress your data files when you are loading large data sets.
The following compression algorithms are supported:
- Zstandard v0.8 and higher
- deflate (with zlib header, RFC1950)
- raw deflate (without header, RFC1951)
If you are loading data from a local file system, Snowflake automatically gzip-compresses your data files by default. The AUTO_COMPRESS option for the PUT command can be modified when you stage your files.
When loading compressed data, specify the compression method for your data files. The COMPRESSION file format option describes how your data files are already compressed in the stage. Set the COMPRESSION option in one of the following ways:
- As a file format option specified directly in the COPY INTO table statement.
- As a file format option specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO table statement.
Supported File Formats¶
The following file formats are supported:
Structured/Semi-structured Type Notes Structured Delimited (CSV, TSV, etc.) Any single-character delimiter is supported; default is comma (i.e. CSV). Semi-structured JSON Avro Includes automatic detection and processing of Snappy-compressed Avro files. ORC Parquet XML Supported as a preview feature.
File format options specify the type of data contained in a file, as well as other related characteristics about the format of the data. The file format options you can specify are different depending on the type of data you plan to load. Snowflake provides a full set of file format option defaults.
Semi-structured Data Loading¶
Snowflake natively supports semi-structured data, which means semi-structured data can be loaded into relational tables without requiring the definition of a schema in advance. Snowflake supports loading semi-structured data directly into columns of type VARIANT (see Semi-structured Data Types for more details).
Currently supported semi-structured data formats include JSON, Avro, ORC, Parquet, or XML:
- For JSON, Avro, ORC, and Parquet data, each top-level, complete object is loaded as a separate row in the table. Each object can contain new line characters and spaces as long as the object is valid.
- For XML data, each top-level element is loaded as a separate row in the table. An element is identified by a start and close tag of the same name.
Typically, tables used to store semi-structured data consist of a single VARIANT column. Once the data is loaded, you can query the data similar to structured data. You can also perform other tasks, such as extracting values and objects from arrays. For more information, see the FLATTEN table function.
Semi-structured data can be loaded into tables with multiple columns, but the semi-structured data must be stored as a field in a structured file (e.g. CSV file). Then, the data can be loaded into a specified column in the table.
Specifying File Format Options¶
Individual file format options can be specified in any of the following places:
- In the definition of a table.
- In the definition of a named stage. For more information, see CREATE STAGE.
- Directly in a COPY INTO table command when loading data.
In addition, to simplify data loading, Snowflake supports creating named file formats, which are database objects that encapsulate all of the required format information. Named file formats can then be used as input in all the same places where you can specify individual file format options, thereby helping to streamline the data loading process for similarly-formatted data.
Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted data.
Creating a Named File Format¶
You can create a file format using either the web interface or SQL:
Web Interface: Click on Databases » db_name » File Formats SQL Command: CREATE FILE FORMAT
For detailed descriptions of all the file format options, see CREATE FILE FORMAT.
The following example creates a named CSV file format with a specified field delimiter. When this file format is referenced, the COPY command skips the first line in the data files:
CREATE OR REPLACE FILE FORMAT my_csv_format TYPE = 'CSV' FIELD_DELIMITER = '|' SKIP_HEADER = 1;
The following example creates a named JSON file format that instructs the JSON parser to remove outer brackets
CREATE OR REPLACE FILE FORMAT my_json_format TYPE = 'JSON' STRIP_OUTER_ARRAY = TRUE;