Transforming Data During a Load

Snowflake supports transforming data while loading it into a table using the COPY INTO <table> command, dramatically simplifying your ETL pipeline for basic transformations. This feature helps you avoid the use of temporary tables to store pre-transformed data when reordering columns during a data load.

The COPY command supports:

  • Column reordering, column omission, and casts using a SELECT statement. There is no requirement for your data files to have the same number and ordering of columns as your target table.
  • The ENFORCE_LENGTH | TRUNCATECOLUMNS option, which can truncate text strings that exceed the target column length.

For general information about querying staged data files, see Querying Staged Data.

In this Topic:

Usage Notes

This section provides usage information for transforming staged data files during a load.

Supported File Formats

To parse a staged data file, it is necessary to describe its file format:

CSV:

The default format is character-delimited UTF-8 text. The default field delimiter is a comma character (,). The default record delimiter is the new line character. If the source data is in another format, specify the file format type and options.

When querying staged data files, the ERROR_ON_COLUMN_COUNT_MISMATCH option is ignored. There is no requirement for your data files to have the same number and ordering of columns as your target table.

JSON:

To transform JSON data during a load operation, you must structure the data files in NDJSON (“Newline delimited JSON”) standard format; otherwise, you might encounter the following error:

Error parsing JSON: more than one document in the input

All file format types:
 

Specify the format type and options that match your data files.

To explicitly specify file format options, set them in one of the following ways:

Querying staged data files using a SELECT statement:
  • As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the SELECT statement.
Loading columns from staged data files using a COPY INTO <table> statement:
  • As file format options specified directly in the COPY INTO <table> statement.
  • As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO <table> statement.

VALIDATION_MODE Parameter

The VALIDATION_MODE parameter does not support COPY statements that transform data during a load.

CURRENT_TIME, CURRENT_TIMESTAMP Default Column Values

When loading data into a table that captures the load time in a column with a default value of either CURRENT_TIME() or CURRENT_TIMESTAMP(), all rows loaded using a specific COPY statement have the same timestamp value. The value records the time that the COPY statement started.

For example:

create or replace mytable(
  c1 timestamp DEFAULT current_timestamp(),
  c2 number
);

copy into mytable(c2)
  from (select t.$1 from @mystage/myfile.csv.gz t);

+-------------------------------+----------+
| C1                            | C2       |
|-------------------------------+----------+
| 2018-09-05 08:58:28.718 -0700 | 1        |
..
| 2018-09-05 08:58:28.718 -0700 | 500      |
+-------------------------------+----------+

Transforming CSV Data

Load a Subset of Table Data

Load a subset of data into a table. For any missing columns, Snowflake inserts the default values. The following example loads data from columns 1, 2, 6, and 7 of a staged CSV file:

copy into home_sales(city, zip, sale_date, price)
   from (select t.$1, t.$2, t.$6, t.$7 from @mystage/sales.csv.gz t)
   FILE_FORMAT = (FORMAT_NAME = mycsvformat);

Reorder CSV Columns During a Load

The following example reorders the column data from a staged CSV file before loading it into a table. Additionally, the COPY statement uses the SUBSTR , SUBSTRING function to remove the first few characters of a string before inserting it:

copy into home_sales(city, zip, sale_date, price)
   from (select SUBSTR(t.$2,4), t.$1, t.$5, t.$4 from @mystage t)
   FILE_FORMAT = (FORMAT_NAME = mycsvformat);

Convert Data Types During a Load

Convert staged data into other data types during a data load. All conversion functions are supported.

For example, convert strings as binary values, decimals, or timestamps using the TO_BINARY, TO_DECIMAL , TO_NUMBER , TO_NUMERIC, and TO_TIMESTAMP / TO_TIMESTAMP_* functions, respectively.

Sample CSV file:

snowflake,2.8,2016-10-5
warehouse,-12.3,2017-01-23

SQL statements:

-- Stage a data file in the internal user stage
PUT file:///tmp/datafile.csv @~;

-- Query the staged data file
select t.$1,t.$2,t.$3 from @~/datafile.csv.gz t;

-- Create the target table
create or replace table casttb (
  col1 binary,
  col2 decimal,
  col3 timestamp_ntz
  );

-- Convert the staged CSV column data to the specified data types before loading it into the destination table
copy into casttb(col1, col2, col3)
from (
  select to_binary(t.$1, 'utf-8'),to_decimal(t.$2, '99.9', 9, 5),to_timestamp_ntz(t.$3)
  from @~/datafile.csv.gz t
)
file_format = (type = csv);

-- Query the target table
select * from casttb;

+--------------------+------+-------------------------+
| COL1               | COL2 | COL3                    |
|--------------------+------+-------------------------|
| 736E6F77666C616B65 |    3 | 2016-10-05 00:00:00.000 |
| 77617265686F757365 |  -12 | 2017-01-23 00:00:00.000 |
+--------------------+------+-------------------------+

Include Sequence Columns in Loaded Data

Create a sequence object using CREATE SEQUENCE. When loading data into a table using the COPY command, access the object using a nextval expression to sequence the data in a target number column. For more information about using sequencies in queries, see Using Sequences.

-- Create a sequence
create sequence seq;

-- Create the target table
create or replace table mytable (
  col1 number default seq.nextval,
  col2 varchar,
  col3 varchar
  );

-- Stage a data file in the internal user stage
PUT file:///tmp/myfile.csv @~;

-- Query the staged data file
select $1, $2 from @~/myfile.csv.gz t;

+-----+-----+
| $1  | $2  |
|-----+-----|
| abc | def |
| ghi | jkl |
| mno | pqr |
| stu | vwx |
+-----+-----+

-- Include the sequence nextval expression in the COPY statement
copy into mytable (col1, col2, col3)
from (
  select seq.nextval, $1, $2
  from @~/myfile.csv.gz t
)
;

select * from mytable;

+------+------+------+
| COL1 | COL2 | COL3 |
|------+------+------|
|    1 | abc  | def  |
|    2 | ghi  | jkl  |
|    3 | mno  | pqr  |
|    4 | stu  | vwx  |
+------+------+------+

Include AUTOINCREMENT / IDENTITY Columns in Loaded Data

Set the AUTOINCREMENT or IDENTITY default value for a number column. When loading data into a table using the COPY command, omit the column in the SELECT statement. The statement automatically populates the column.

-- Create the target table
create or replace table mytable (
  col1 number autoincrement start 1 increment 1,
  col2 varchar,
  col3 varchar
  );

-- Stage a data file in the internal user stage
PUT file:///tmp/myfile.csv @~;

-- Query the staged data file
select $1, $2 from @~/myfile.csv.gz t;

+-----+-----+
| $1  | $2  |
|-----+-----|
| abc | def |
| ghi | jkl |
| mno | pqr |
| stu | vwx |
+-----+-----+

-- Omit the sequence column in the COPY statement
copy into mytable (col2, col3)
from (
  select $1, $2
  from @~/myfile.csv.gz t
)
;

select * from mytable;

+------+------+------+
| COL1 | COL2 | COL3 |
|------+------+------|
|    1 | abc  | def  |
|    2 | ghi  | jkl  |
|    3 | mno  | pqr  |
|    4 | stu  | vwx  |
+------+------+------+

Transforming Semi-structured Data

Load JSON Data into Separate Columns

The following example loads repeating elements from a staged JSON file into separate table columns with different data types.

In this example, we are loading the following JSON data into separate columns in a relational table, with the location object values loaded into a VARIANT column and the remaining values loaded into relational columns:

-- Sample data:
{"location": {"city": "Lexington","zip": "40503"},"dimensions": {"sq_ft": "1000"},"type": "Residential","sale_date": "4-25-16","price": "75836"},
{"location": {"city": "Belmont","zip": "02478"},"dimensions": {"sq_ft": "1103"},"type": "Residential","sale_date": "6-18-16","price": "92567"},
{"location": {"city": "Winchester","zip": "01890"},"dimensions": {"sq_ft": "1122"},"type": "Condo","sale_date": "1-31-16","price": "89921"}

The example loads the file sales.json from the internal stage mystage:

-- Create an internal stage with the file delimiter set as none and the record delimiter set as the new line character
create or replace stage mystage
  file_format = (type = 'csv' field_delimiter = none record_delimiter = '\\n');

-- Stage a JSON data file in the internal stage with the default values
put file:///tmp/sales.json @mystage;

-- Query the staged data. The data file comprises three objects in ndjson format.
select t.$1 from @mystage/sales.json.gz t;

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| $1                                                                                                                                                |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| {"location": {"city": "Lexington","zip": "40503"},"dimensions": {"sq_ft": "1000"},"type": "Residential","sale_date": "4-25-16","price": "75836"}, |
| {"location": {"city": "Belmont","zip": "02478"},"dimensions": {"sq_ft": "1103"},"type": "Residential","sale_date": "6-18-16","price": "92567"},   |
| {"location": {"city": "Winchester","zip": "01890"},"dimensions": {"sq_ft": "1122"},"type": "Condo","sale_date": "1-31-16","price": "89921"}       |
+---------------------------------------------------------------------------------------------------------------------------------------------------+

-- Create a target table for the JSON data
create or replace table home_sales (
  location variant,
  sq_ft number,
  type string default 'Residential',
  sale_date string,
  price string
);

-- Copy elements from the staged JSON file into the target table
-- Note that all JSON data is stored in a single column ($1)
copy into home_sales(location, sq_ft, sale_date, price)
   from (select parse_json($1):location, parse_json($1):dimensions.sq_ft, parse_json($1):sale_date, parse_json($1):price
   from @mystage/sales.json.gz t);

-- Query the target table
SELECT * from home_sales;

+-------------------------+-------+-------------+-----------+-------+
| LOCATION                | SQ_FT | TYPE        | SALE_DATE | PRICE |
|-------------------------+-------+-------------+-----------+-------|
| {                       |  1000 | Residential | 4-25-16   | 75836 |
|   "city": "Lexington",  |       |             |           |       |
|   "zip": "40503"        |       |             |           |       |
| }                       |       |             |           |       |
| {                       |  1103 | Residential | 6-18-16   | 92567 |
|   "city": "Belmont",    |       |             |           |       |
|   "zip": "02478"        |       |             |           |       |
| }                       |       |             |           |       |
| {                       |  1122 | Residential | 1-31-16   | 89921 |
|   "city": "Winchester", |       |             |           |       |
|   "zip": "01890"        |       |             |           |       |
| }                       |       |             |           |       |
+-------------------------+-------+-------------+-----------+-------+

Flatten JSON Data

FLATTEN is a table function that produces a lateral view of a VARIANT, OBJECT, or ARRAY column. Using the sample data from Load JSON Data into Separate Columns, create a table with a separate row for each element in the JSON objects.

-- Create an internal stage with the file delimiter set as none and the record delimiter set as the new line character
create or replace stage mystage
  file_format = (type = 'csv' field_delimiter = none record_delimiter = '\\n');

-- Stage a JSON data file in the internal stage with the default values
put file:///tmp/sales.json @mystage;

-- Create a table composed of the output from the FLATTEN function
create or replace table flattened_source
(seq string, key string, path string, index string, value variant, element variant)
as
  select
    seq::string
  , key::string
  , path::string
  , index::string
  , value::variant
  , this::variant
  from @mystage/realestate-ndjson.json.gz
    , table(flatten(input => parse_json($1)));

  select * from flattened_source;

+-----+-----------+-----------+-------+-------------------------+-----------------------------+
| SEQ | KEY       | PATH      | INDEX | VALUE                   | ELEMENT                     |
|-----+-----------+-----------+-------+-------------------------+-----------------------------|
| 1   | location  | location  | NULL  | {                       | {                           |
|     |           |           |       |   "city": "Lexington",  |   "location": {             |
|     |           |           |       |   "zip": "40503"        |     "city": "Lexington",    |
|     |           |           |       | }                       |     "zip": "40503"          |
|     |           |           |       |                         |   },                        |
|     |           |           |       |                         |   "price": "75836",         |
|     |           |           |       |                         |   "sale_date": "2017-3-5",  |
|     |           |           |       |                         |   "sq__ft": "1000",         |
|     |           |           |       |                         |   "type": "Residential"     |
|     |           |           |       |                         | }                           |
..
| 3   | type      | type      | NULL  | "Condo"                 | {                           |
|     |           |           |       |                         |   "location": {             |
|     |           |           |       |                         |     "city": "Winchester",   |
|     |           |           |       |                         |     "zip": "01890"          |
|     |           |           |       |                         |   },                        |
|     |           |           |       |                         |   "price": "89921",         |
|     |           |           |       |                         |   "sale_date": "2017-3-21", |
|     |           |           |       |                         |   "sq__ft": "1122",         |
|     |           |           |       |                         |   "type": "Condo"           |
|     |           |           |       |                         | }                           |
+-----+-----------+-----------+-------+-------------------------+-----------------------------+

Split JSON Elements and Load as VARIANT Values into Separate Columns

Following the instructions in Load JSON Data into Separate Columns, you can load individual JSON elements into different columns in your target table. Additionally, using the SPLIT function, you can split element values that contain a separator and load them as an array.

For example, split IP addresses on the dot separator in repeating JSON elements. Load the IP addresses as arrays in separate columns:

-- Create an internal stage with the file delimiter set as none and the record delimiter set as the new line character
create or replace stage mystage
  file_format = (type = 'csv' field_delimiter = none record_delimiter = '\\n');

-- Stage a JSON data file in the internal stage
put file:///tmp/ipaddress.json @mystage auto_compress=true;

-- Query the staged data
select t.$1 from @mystage/ipaddress.json.gz t;

+----------------------------------------------------------------------+
| $1                                                                   |
|----------------------------------------------------------------------|
| {"ip_address": {"router1": "192.168.1.1","router2": "192.168.0.1"}}, |
| {"ip_address": {"router1": "192.168.2.1","router2": "192.168.3.1"}}  |
+----------------------------------------------------------------------+

-- Create a target table for the JSON data
create or replace table splitjson (
  col1 array,
  col2 array
  );

-- Split the JSON elements into individual arrays using the SPLIT function and load them into separate columns
-- Note that all JSON data is stored in a single column ($1)
copy into splitjson(col1, col2)
from (
  select split(parse_json($1):ip_address.router1, '.'),split(parse_json($1):ip_address.router2, '.')
  from @mystage/ipaddress.json.gz t
);

-- Query the target table
select * from splitjson;

+----------+----------+
| COL1     | COL2     |
|----------+----------|
| [        | [        |
|   "192", |   "192", |
|   "168", |   "168", |
|   "1",   |   "0",   |
|   "1"    |   "1"    |
| ]        | ]        |
| [        | [        |
|   "192", |   "192", |
|   "168", |   "168", |
|   "2",   |   "3",   |
|   "1"    |   "1"    |
| ]        | ]        |
+----------+----------+

Load Parquet Data into Separate Columns

The following example loads elements in a Parquet file into separate table columns in a relational table with different data types:

-- Create a file format object that sets the file format type. Accept the default options.
create or replace file format my_parquet_format
  type = 'parquet';

-- Create an internal stage and specify the new file format
create or replace temporary stage mystage
  file_format = my_parquet_format;

-- Create a target table for the data.
create or replace table parquet_col (
  custKey number default NULL,
  orderDate date default NULL,
  orderStatus varchar(100) default NULL,
  price varchar(255)
);

-- Stage a data file in the internal stage
put file:///tmp/mydata.parquet @%parquet_col;

-- Copy data from elements in the staged Parquet file into separate columns
-- in the target table.
-- Note that all Parquet data is stored in a single column ($1)
-- SELECT list items correspond to element names in the Parquet file
-- Cast element values to the target column data type
copy into parquet_col
  from (select
  $1:o_custkey::number,
  $1:o_orderdate::date,
  $1:o_orderstatus::varchar,
  $1:o_totalprice::varchar
  from @%parquet_col/mydata.parquet);

-- Query the target table
SELECT * from parquet_col;

+---------+------------+-------------+-----------+
| CUSTKEY | ORDERDATE  | ORDERSTATUS | PRICE     |
|---------+------------+-------------+-----------|
|   27676 | 1996-09-04 | O           | 83243.94  |
|  140252 | 1994-01-09 | F           | 198402.97 |
..
+---------+------------+-------------+-----------+