Loading Data

This topic provides best practices, general guidelines, and important considerations for loading staged data.

In this Topic:

Options for Selecting Staged Data Files

The COPY command supports several options for loading data files from a stage:

  • By path (internal stages) / prefix (Amazon S3 bucket). See Organizing Data by Path for information.
  • Specifying a list of specific files to load.
  • Using pattern matching to identify specific files by pattern.

These options enable you to copy a fraction of the staged data into Snowflake with a single command. This allows you to execute concurrent COPY statements that match a subset of files, taking advantage of parallel operations.

Lists of Files

The COPY INTO <table> command includes a FILES parameter to load files by specific name.

Tip

Of the three options for identifying/specifying data files to load from a stage, providing a discrete list of files is generally the fastest; however, the FILES parameter supports a maximum of 1,000 files, meaning a COPY command executed with the FILES parameter can only load up to 1,000 files.

For example:

COPY INTO load1 FROM @%load1/data1/ FILES=('test1.csv', 'test2.csv', 'test3.csv')

File lists can be combined with paths for further control over data loading.

Pattern Matching

The COPY INTO <table> command includes a PATTERN parameter to load files using a regular expression.

For example:

COPY INTO people_data FROM @%people_data/data1/
   PATTERN='.*person_data[^[0-9]{1,3}$$].csv';

In general, pattern matching using a regular expression is generally the slowest of the three options for identifying/specifying data files to load from a stage; however, this option works well if you exported your files in named order from your external application and want to batch load the files in the same order.

Pattern matching can be combined with paths for further control over data loading.

Executing Parallel COPY Statements That Reference the Same Data Files

When a COPY statement is executed, Snowflake sets a load status in the table metadata for the data files referenced in the statement. This prevents parallel COPY statements from loading the same files into the table, avoiding data duplication.

When processing of the COPY statement is completed, Snowflake adjusts the load status for the data files as appropriate. If one or more data files fail to load, Snowflake sets the load status for those files as load failed. These files are available for a subsequent COPY statement to load.

Loading Older Files

This section describes how the COPY INTO <table> command prevents data duplication differently based on whether the load status for a file is known or unknown. If you partition your data in stages using logical, granular paths by date (as recommended in Organizing Data by Path) and load data within a short period of time after staging it, this section largely does not apply to you. However, if the COPY command skips older files (i.e. historical data files) in a data load, this section describes how to bypass the default behavior.

Load Metadata

Snowflake maintains detailed metadata for each table into which data is loaded, including:

  • Name of each file from which data was loaded
  • File size
  • ETag for the file
  • Number of rows parsed in the file
  • Timestamp of the last load for the file
  • Information about any errors encountered in the file during loading

This load metadata expires after 64 days. If the LAST_MODIFIED date for a staged data file is less than or equal to 64 days, the COPY command can determine its load status for a given table and prevent reloading (and data duplication). The LAST_MODIFIED date is the timestamp when the file was initially staged or when it was last modified, whichever is later.

If the LAST_MODIFIED date is older than 64 days, the load status is still known if either of the following events occurred less than or equal to 64 days prior to the current date:

  • The file was loaded successfully.
  • The initial set of data for the table (i.e. the first batch after the table was created) was loaded.

However, the COPY command cannot definitively determine whether a file has been loaded already if the LAST_MODIFIED date is older than 64 days and the initial set of data was loaded into the table more than 64 days earlier (and if the file was loaded into the table, that also occurred more than 64 days earlier). In this case, to prevent accidental reload, the command skips the file by default.

Workarounds

To load files whose metadata has expired, set the LOAD_UNCERTAIN_FILES copy option to true. The copy option references load metadata, if available, to avoid data duplication, but also attempts to load files with expired load metadata.

Alternatively, set the FORCE option to load all files, ignoring load metadata if it exists. Note that this option reloads files, potentially duplicating data in a table.

Examples

../_images/data-load-status1.png

In this example:

  • A table is created on January 1, and the initial table load occurs on the same day.
  • 64 days pass. On March 7, the load metadata expires.
  • A file is staged and loaded into the table on July 27 and 28, respectively. Because the file was staged one day prior to being loaded, the LAST_MODIFIED date was within 64 days. The load status was known. There are no data or formatting issues with the file, and the COPY command loads it successfully.
  • 64 days pass. On September 28, the LAST_MODIFIED date for the staged file exceeds 64 days. On September 29, the load metadata for the successful file load expires.
  • An attempt is made to reload the file into the same table on November 1. Because the COPY command cannot determine whether the file has been loaded already, the file is skipped. The LOAD_UNCERTAIN_FILES copy option (or the FORCE copy option) is required to load the file.
../_images/data-load-status2.png

In this example:

  • A file is staged on January 1.
  • 64 days pass. On March 7, the LAST_MODIFIED date for the staged file exceeds 64 days.
  • A new table is created on September 29, and the staged file is loaded into the table. Because the initial table load occurred less than 64 days prior, the COPY command can determine that the file had not been loaded already. There are no data or formatting issues with the file, and the COPY command loads it successfully.

Removing JSON “null” Values

In a VARIANT column, NULL values are stored as a string containing the word “null,” not the SQL NULL value. If the “null” values in your JSON documents indicate missing values and have no other special meaning, we recommend setting the file format option STRIP_NULL_VALUES to TRUE for the COPY INTO <table> command when loading the JSON files. Retaining the “null” values often wastes storage and slows query processing.