Data Loading Considerations

This topic provides best practices, general guidelines, and important considerations for data loading. It is intended to help simplify and optimize bulk importing of data from external files into Snowflake tables using the PUT and COPY INTO table commands.

In this Topic:

Preparing Your Data Files

Preparing Delimited Text Files

Consider the following guidelines when preparing your delimited text (CSV) files for loading:

  • Files must contain ASCII data encoded as UTF-8.
  • Files should contain records delimited by a single character and fields delimited by a different single character. Snowflake supports ASCII characters (including high-order characters) as delimiters. Common field delimiters include the pipe (|), comma (,), caret (^), and tilde (~). The line feed (\n) character is a common row delimiter choice.
  • Fields that contain delimiter characters should be enclosed in quotes (single or double). If the data contains single or double quotes, then those quotes must be escaped.
  • Carriage returns should not appear in the data. These are commonly introduced on Windows systems in conjunction with a line feed character to mark the end of a line (\r \n).
  • The number of columns in each row should be consistent.

Numeric Data Guidelines

  • Avoid embedded characters, such as commas (e.g., 123,456).
  • If a number includes a fractional component, it should be separated from the whole number portion by a decimal point (e.g., 123456.789).
  • Oracle only. The Oracle NUMBER or NUMERIC types allow for arbitrary scale, meaning they accept values with decimal components even if the data type was not defined with a precision or scale. Whereas in Snowflake, columns designed for values with decimal components must be defined with a scale to preserve the decimal portion.

Date and Timestamp Data Guidelines

  • Date, time, and timestamp data should be formatted based on the following components:

    Format Description
    YYYY Four-digit year.
    YY Two-digit year, controlled by the TWO_DIGIT_CENTURY_START session parameter, e.g. when set to 1980, values of 79 and 80 parsed as 2079 and 1980 respectively.
    MM Two-digit month (01=January, etc.).
    MON Full or abbreviated month name.
    DD Two-digit day of month (01 through 31).
    DY Abbreviated day of week.
    HH24 Two digits for hour (00 through 23); am/pm NOT allowed.
    HH12 Two digits for hour (01 through 12); am/pm allowed.
    AM , PM Ante meridiem (am) / post meridiem (pm); for use with HH12.
    MI Two digits for minute (00 through 59).
    SS Two digits for second (00 through 59).
    FF Fractional seconds with precision 0 (seconds) to 9 (nanoseconds), e.g. FF, FF0, FF3, FF9. Specifying FF is equivalent to FF6 (microseconds).
    TZH:TZM , TZHTZM , TZH Time zone hour and minute (can be prefixed by +/- for sign; Z only means UTC).
  • Oracle only. The Oracle DATE data type can contain date or timestamp information. If your Oracle database includes DATE columns that also store time-related information, map these columns to a TIMESTAMP data type in Snowflake rather than DATE.

Note

Snowflake checks temporal data values at load time. Invalid date, time, and timestamp values (e.g., 0000-00-00) produce an error.

Splitting Large CSV Data Files Before Loading

The number of COPY INTO table operations that run in parallel cannot exceed the number of data files to be loaded. To increase the number of parallel operations, we recommend that you split large chunks of data into files roughly 10-100 MB in size compressed. Splitting large files into a greater number of smaller files distributes the load among the servers in an active warehouse, thereby increasing performance. The number of data files that are processed in parallel is determined by the number and capacity of servers in a warehouse.

If your source database does not allow you to export data files in smaller chunks, you can use a third-party utility to split large CSV files.

Linux/Unix or Mac OSX

The split utility enables you to split a CSV file into multiple smaller files.

Syntax:

split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]

For more information, type man split in a terminal window.

Example:

split -b 10m pagecounts-20151201.csv pages

This example splits a file named pagecounts-20151201.csv into files 10 MB in size. The split files are named pages<suffix>.

Windows

Windows does not include a native file split utility; however, Windows supports many third-party tools and scripts that can split large data files.

JSON Data Size Limitations

Limit the size of JSON objects to less than 16 MB (ideally, 8 MB). The VARIANT data type imposes a size limit that depends on the depth and complexity of the JSON object.

In general, JSON data sets are a simple concatenation of multiple JSON documents. There is no need to separate the documents with line breaks or commas, though both are supported.

The JSON output from some software is composed of a single huge array containing multiple records. We recommend enabling the SKIP_OUTER_ARRAY file format option for the COPY INTO table command to remove the outer array structure and load the records into separate table rows.

COPY INTO <table>
FROM @~/<file>.json
FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = true);

Parquet Data Size Limitations

Currently, data loads of large Parquet files (e.g. greater than 3GB) could time out. Split large files into files 1GB in size or smaller for loading.

Planning a Data Load

Using COPY or INSERT Command

Use COPY for bulk data loads, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows. The COPY command can load data in parallel. See Splitting Large CSV Data Files Before Loading. COPY loads large amounts of data much more efficiently than using INSERT statements.

Dedicating Separate Warehouses to Load and Query Operations

Loading large data sets can affect query performance. Dedicate separate warehouses for loading and querying operations to optimize performance for each.

Data loading requires minimal resources. For cost effectiveness, a standard virtual warehouse is generally sufficient to load data. Choose the size of that standard warehouse based on how fast you want the loading task to complete. As mentioned in Splitting Large CSV Data Files Before Loading, splitting larger data files allows the load to scale linearly.

Choosing a Staging Location

Consider the best location for staging specific data files. Each option includes benefits and potential drawbacks.

User Staging Location

Each user has an internal Snowflake staging location allocated to them by default for storing files. This location is a convenient option if your files do not need to be accessible to multiple users, but need to be copied into multiple tables.

This option is not appropriate if:

  • Multiple users require access the files.
  • The user does not have INSERT privileges on the tables the data will be loaded into.

Table Staging Location

Each table has an internal Snowflake staging location allocated to them by default for storing files. The default staging location allocated for each table is a convenient option if your files need to be accessible to multiple users and only need to be copied into a single table.

Note that a table staging location is not a separate database object; rather, it is an implicit stage tied to the table itself. A table staging location has no grantable privileges of its own. To stage files to a table staging location, you must have OWNERSHIP of the table itself.

This option is not appropriate if you wish to copy the data in the files into multiple tables.

Internal Named Stage Object

Internal stages provide the greatest amount of flexibility. Each internal stage is automatically allocated it’s own staging location. Users with the appropriate privileges can load data from an internal stage into any table. Ownership of the stage can be transferred to another role, and privileges granted to use the stage can be modified to add or remove roles. When you create a stage, you must explicitly grant privileges on the stage to one or more roles before those roles can use the stage. If you plan to stage data files that will be loaded only by you, or will be loaded only into a single table, then you may prefer to simply use either the user or table staging location.

External Named Stage Object

Similar to internal stages, external stages provide a high degree of flexibility and are an extremely convenient option if you already have existing AWS S3 buckets that you use to store data files. By creating an external stage that references one of your S3 buckets, any user with privileges on the stage can use it to copy data from files in the bucket directly into any table.

Staging Data

Organizing Data by Path

The staging location in the COPY INTO table command can include a path (known as a prefix in AWS terminology). When staging regular data sets to the same location, we recommend partitioning the data into logical paths that include identifying details such as geographical location or other source identifiers, along with the date when the data was written. By staging your data by path, you can then copy any fraction of the partitioned data into Snowflake with a single command.

For example, if you were storing data for a North American company by location, you might include identifiers such as continent, country, and city in paths along with data write dates:

  • Canada/Ontario/Toronto/2016/07/10/05/
  • United_States/California/Los_Angeles/2016/06/01/11/
  • United_States/New York/New_York/2016/12/21/03/
  • United_States/California/San_Francisco/2016/08/03/17/

When you create a named stage, you can specify any part of a path. For example, to create an external stage using one of the example paths:

CREATE STAGE my_stage URL='s3://mybucket/United_States/California/Los_Angeles/' CREDENTIALS=(AWS_KEY_ID='1a2b3c' AWS_SECRET_KEY='4x5y6z');

You can also add a path when you stage files in an internal staging location without a named stage. For example, stage mydata.csv in a specific path in the t1 table staging location:

PUT file:///data/mydata.csv @%t1/United_States/California/Los_Angeles/2016/06/01/11/

When loading your staged data, narrow the path to the most granular level that includes your data for improved data load performance. You can add the FILES (see Lists of Files in this topic) or PATTERN (see Pattern Matching in this topic) options to further confine the list of files to load:

COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/
  FILES=('mydata1.csv', 'mydata1.csv');

COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/
  PATTERN='.*mydata[^[0-9]{1,3}$$].csv';

Loading Data

Options for Selecting Staged Data Files

Snowflake has several different options for loading data files from a staging location:

  • By path (internal staging locations) / prefix (Amazon S3 bucket). See Organizing Data by Path in this topic.
  • Specifying a list of specific files to load.
  • Using pattern matching to identify specific files by pattern.

Lists of Files

The COPY INTO table command includes a FILES parameter to load files by specific name.

Tip

Of the three options for identifying/specifying data files to load from a staging location, providing a discrete list of files is generally the fastest; however, the FILES parameter supports a maximum of 1,000 files, meaning a COPY command executed with the FILES parameter can only load up to 1,000 files.

For example:

COPY INTO load1 FROM @%load1/data1/ FILES=('test1.csv', 'test2.csv', 'test3.csv')

File lists can be combined with paths for further control over data loading.

Pattern Matching

The COPY INTO table command includes a PATTERN parameter to load files using a regular expression.

For example:

COPY INTO people_data FROM @%people_data/data1/
   PATTERN='.*person_data[^[0-9]{1,3}$$].csv';

In general, pattern matching using a regular expression is generally the slowest of the three options for identifying/specifying data files to load from a staging location; however, this option works well if you exported your files in named order from your external application and want to batch load the files in the same order.

Pattern matching can be combined with paths for further control over data loading.

Using File Formats

Non-default CSV Delimiters

If your data files are stored in a format other than delimited text (CSV) with a , (comma) delimiter, you must provide either a named file format or the specific format options in the COPY INTO table command. In addition:

  • Verify that the non-default delimiter character is escaped in data strings.
  • Indicate the escape character in the file format or directly in the COPY command.

Leading Spaces

If your external database software encloses fields in quotes but inserts a leading space, Snowflake reads the leading space rather than the opening quotation character as the beginning of the field. The quotation characters are interpreted as string data.

Use the TRIM_SPACE file format option to remove undesirable spaces during the data load.

For example, fields in the following CSV file include a leading space:

"value1", "value2", "value3"

The following COPY command trims the leading space and removes the quotation marks enclosing each field:

COPY INTO mytable
FROM @%mytable
FILE_FORMAT = (TYPE = CSV TRIM_SPACE=true FIELD_OPTIONALLY_ENCLOSED_BY = '0x22');

SELECT * FROM mytable;

+--------+--------+--------+
| col1   | col2   | col3   |
+--------+--------+--------+
| value1 | value2 | value3 |
+--------+--------+--------+

Removing JSON “null” Values

When converted to a string, JSON “null” values become a string containing the word “null,” not the SQL NULL value. This behavior allows distinguishing “null” values from values that are absent, which produce a SQL NULL.

If the “null” values in your JSON documents indicate missing values and have no other special meaning, we recommend setting the file format option STRIP_NULL_VALUES to TRUE for the COPY INTO table command when loading the JSON files. Retaining the “null” values often wastes storage and slows query processing.

Limiting Concurrent DML for a Table

When a DML statement is issued on a table, such as the COPY INTO table command, a lock is acquired on the target table. When multiple DMLs are issued against a table, only one of the statements executes at a time, while the others are queued.

The maximum number of queued statements for a single table is 8. If you exceed this limit, the following error occurs:

Exceeded limit of the number of statements that can be queued on a target resource.

If you load data using a script, the script should take this maximum into account and enforce limits on COPY commands that target a single table.

Activities Following a Data Load

Deleting Data Files After Loading

When the data in a staged file has been loaded successfully, Snowflake maintains metadata about the status of the file to prevent loading the file again in subsequent loads. However, to prevent confusion or errors, we suggest deleting staged files after they’ve loaded successfully.

You can delete loaded files in two ways:

  • As part of the load by specifying the PURGE parameter in the COPY INTO table command
  • Using the REMOVE command after the COPY command completes.

Removing files ensures they aren’t inadvertently loaded again. It also improves load performance, because it reduces the number of files that COPY statements must scan to verify whether existing files in a staging location were loaded already.

Changing the Sort Key

Snowflake loads data into tables in the order presented in the data files. In general, queries leverage the natural order of data by timestamp to build optimal execution plans. In some situations, ordering the data by a specific column may result in better performance than the natural ordering of the data in the source file.

For example, consider a set of data collected from a stock trading system that logs individual trades throughout the day. The system database records trades in the order they were generated. However, the analysts always query the data by ticker rather than timestamp. Sorting the data in tables by timestamp and then ticker instead of timestamp alone would improve query performance.

To sort the data during the loading process:

  1. Create a transient staging table, if one does not exist. The schema should match the schema of your production table.

  2. Complete the steps to copy your data to the staging table:

  3. Move the data to the production table. For example:

    INSERT INTO <production_table> SELECT * FROM <staging_table> ORDER BY <timestamp>, <ticker>;
    

Managing Regular Data Loads

When planning regular data loads such as ETL (Extract, Transform, Load) processes or regular imports of machine-generated data, it’s important to partition the data in your Snowflake staging location or S3 bucket using logical, granular paths. Create a partitioning structure that includes identifying details such as application or location, along with the date when the data was written. You can then copy any fraction of the partitioned data into Snowflake with a single command. You can copy data into Snowflake by the hour, day, month, or even year when you initially populate tables.

S3 bucket examples:

  • s3://bucket_name/application_one/2016/07/01/11/
  • s3://bucket_name/application_two/location_one/2016/07/01/14/

You can also add a path when you stage files in an internal staging location. For example:

PUT file:///tmp/file_20160701.11*.csv @my_stage/application_one/location_one/2016/07/01/11/;

Where:

application_one, application_two, location_one, etc.

Identifying details for the source of all data in the path. The data can be organized by the date when it was written. An optional 24-hour directory reduces the amount of data in each directory.

Note

S3 transmits a directory list with each COPY statement used by Snowflake, so reducing the number of files in each directory improves the performance of your COPY statements. You may even consider creating subfolders of 10-15 minute increments within the folders for each hour.

Loading Staged Data

Load organized data files into Snowflake tables by specifying the precise path to the staged files. For more information, see Organizing Data by Path in this topic.

Removing Loaded Data Files

When data from staged files is loaded successfully, consider removing the staged files to ensure the data isn’t inadvertently loaded again (duplicated). For more information, see Deleting Data Files After Loading.