• Docs »
  • Transforming Data During a Load

Transforming Data During a Load

Snowflake supports transforming data while loading it into a table using the COPY INTO table command, dramatically simplifying your ETL pipeline for basic transformations. This feature helps you avoid the use of temporary tables to store pre-transformed data when reordering columns during a data load.

The COPY command supports:

  • Column reordering, column omission, and casts using a SELECT statement. There is no requirement for your CSV files to have the same number and ordering of columns as your destination table.
  • The ENFORCE_LENGTH or TRUNCATECOLUMNS option, which can truncate text strings that exceed the target column length.

Important

  • The COPY command supports both CSV and semi-structured file types for data transformations; however, even when loading semi-structured data, e.g. JSON, you should set CSV as the file format type (default value). You could use the JSON file format, but any error in the transformation would stop the COPY operation, even if you set the ON_ERROR option to continue or skip the file.
  • To take advantage of Snowflake’s error management policies, you should structure JSON data as newline delimited JSON (NDJSON) when separating JSON data into different columns during a load.

In this Topic:

Supported Functions

Snowflake currently supports the following subset of functions when using a query as the source for the COPY command:

Loading a Subset of Table Data

Load a subset of data into a table. For any missing columns, Snowflake inserts the default values. The following example loads data from columns 1, 2, 6, and 7 of a staged CSV file:

copy into home_sales(city, zip, sale_date, price)
   from (select t.$1, t.$2, t.$6, t.$7 from @mystage/sales.csv.gz t)
   FILE_FORMAT = (FORMAT_NAME = mycsvformat)
   on_error = 'continue';

Reordering CSV Columns During a Load

The following example reorders the column data from a staged CSV file before loading it into a table. Additionally, the COPY statement uses the SUBSTR , SUBSTRING function to remove the first few characters of a string before inserting it:

copy into home_sales(city, zip, sale_date, price)
   from (select SUBSTR(t.$2,4), t.$1, t.$5, t.$4 from @mystage t)
   FILE_FORMAT = (FORMAT_NAME = mycsvformat)
   on_error = 'continue';

Converting Data Types During a Load

Convert staged data into other data types during a data load. All conversion functions are supported.

For example, convert strings as binary values, decimals, or timestamps using the TO_BINARY, TO_DECIMAL , TO_NUMBER , TO_NUMERIC, and TO_TIMESTAMP / TO_TIMESTAMP_* functions, respectively.

Sample CSV file:

snowflake,2.8,2016-10-5
warehouse,-12.3,2017-01-23

SQL statements:

-- Stage a data file in the internal user stage
PUT file:///tmp/datafile.csv @~;

-- Query the staged data file
select t.$1,t.$2,t.$3 from @~/datafile.csv.gz t;

-- Create the destination table
create or replace table casttb (
  col1 binary,
  col2 decimal,
  col3 timestamp_ntz
  );

-- Convert the staged CSV column data to the specified data types before loading it into the destination table
copy into casttb(col1, col2, col3)
from (
  select to_binary(t.$1, 'utf-8'),to_decimal(t.$2, '99.9', 9, 5),to_timestamp_ntz(t.$3)
  from @~/datafile.csv.gz t
)
file_format = (type = csv);

-- Query the destination table
select * from casttb;

+--------------------+------+-------------------------+
| COL1               | COL2 | COL3                    |
|--------------------+------+-------------------------|
| 736E6F77666C616B65 |    3 | 2016-10-05 00:00:00.000 |
| 77617265686F757365 |  -12 | 2017-01-23 00:00:00.000 |
+--------------------+------+-------------------------+

Including Sequence Columns in Loaded Data

There are multiple ways to include sequence columns in loaded data.

Including a Sequence Object in the SELECT Statement

Create a sequence object using CREATE SEQUENCE. When loading data into a table using the COPY command, access the object using a nextval expression to sequence the data in a destination number column. For more information about using sequencies in queries, see Using Sequences in Queries.

-- Create a sequence
create sequence seq;

-- Create the destination table
create or replace table mytable (
  col1 number default seq.nextval,
  col2 varchar,
  col3 varchar
  );

-- Stage a data file in the internal user stage
PUT file:///tmp/myfile.csv @~;

-- Query the staged data file
select $1, $2 from @~/myfile.csv.gz t;

+-----+-----+
| $1  | $2  |
|-----+-----|
| abc | def |
| ghi | jkl |
| mno | pqr |
| stu | vwx |
+-----+-----+

-- Include the sequence nextval expression in the COPY statement
copy into mytable (col1, col2, col3)
from (
  select seq.nextval, $1, $2
  from @~/myfile.csv.gz t
)
;

select * from mytable;

+------+------+------+
| COL1 | COL2 | COL3 |
|------+------+------|
|    1 | abc  | def  |
|    2 | ghi  | jkl  |
|    3 | mno  | pqr  |
|    4 | stu  | vwx  |
+------+------+------+

Omitting the Sequence Column in the SELECT Statement

Set the AUTOINCREMENT default value for a number column. When loading data into a table using the COPY command, omit the sequence column in the SELECT statement. The statement automatically populates the column.

-- Create the destination table
create or replace table mytable (
  col1 number autoincrement start 1 increment 1,
  col2 varchar,
  col3 varchar
  );

-- Stage a data file in the internal user stage
PUT file:///tmp/myfile.csv @~;

-- Query the staged data file
select $1, $2 from @~/myfile.csv.gz t;

+-----+-----+
| $1  | $2  |
|-----+-----|
| abc | def |
| ghi | jkl |
| mno | pqr |
| stu | vwx |
+-----+-----+

-- Omit the sequence column in the COPY statement
copy into mytable (col2, col3)
from (
  select $1, $2
  from @~/myfile.csv.gz t
)
;

select * from mytable;

+------+------+------+
| COL1 | COL2 | COL3 |
|------+------+------|
|    1 | abc  | def  |
|    2 | ghi  | jkl  |
|    3 | mno  | pqr  |
|    4 | stu  | vwx  |
+------+------+------+

Loading JSON Data into Separate Columns

The following example loads repeating elements from a staged JSON file into separate table columns with different data types.

In this example, we are loading the following JSON data into separate columns in a relational table, with the location object values loaded into a VARIANT column and the remaining values loaded into relational columns:

-- Sample data:
{"location": {"city": "Lexington","zip": "40503"},"dimensions": {"sq_ft": "1000"},"type": "Residential","sale_date": "4-25-16","price": "75836"},
{"location": {"city": "Belmont","zip": "02478"},"dimensions": {"sq_ft": "1103"},"type": "Residential","sale_date": "6-18-16","price": "92567"},
{"location": {"city": "Winchester","zip": "01890"},"dimensions": {"sq_ft": "1122"},"type": "Condo","sale_date": "1-31-16","price": "89921"}

The example loads the file sales.json from the internal stage mystage:

-- Create an internal stage with the file delimiter set as none and the record delimiter set as the new line character
create or replace stage mystage
  file_format = (type = 'csv' field_delimiter = none record_delimiter = '\\n');

-- Stage a JSON data file in the internal stage with the default values
put file:///tmp/sales.json @mystage;

-- Query the staged data. The data file comprises three objects in NDJSON format.
select t.$1 from @mystage/sales.json.gz t;

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| $1                                                                                                                                                |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| {"location": {"city": "Lexington","zip": "40503"},"dimensions": {"sq_ft": "1000"},"type": "Residential","sale_date": "4-25-16","price": "75836"}, |
| {"location": {"city": "Belmont","zip": "02478"},"dimensions": {"sq_ft": "1103"},"type": "Residential","sale_date": "6-18-16","price": "92567"},   |
| {"location": {"city": "Winchester","zip": "01890"},"dimensions": {"sq_ft": "1122"},"type": "Condo","sale_date": "1-31-16","price": "89921"}       |
+---------------------------------------------------------------------------------------------------------------------------------------------------+

-- Create a destination table for the JSON data
create or replace table home_sales (
  location variant,
  sq_ft number,
  type string default 'Residential',
  sale_date string,
  price string
);

-- Copy elements from the staged JSON file into the destination table
-- Note that all JSON data is stored in a single column ($1)
copy into home_sales(location, sq_ft, sale_date, price)
   from (select parse_json($1):location, parse_json($1):dimensions.sq_ft, parse_json($1):sale_date, parse_json($1):price
   from @mystage/sales.json.gz t)
   on_error = 'continue';

-- Query the destination table
SELECT * from home_sales;

+-------------------------+-------+-------------+-----------+-------+
| LOCATION                | SQ_FT | TYPE        | SALE_DATE | PRICE |
|-------------------------+-------+-------------+-----------+-------|
| {                       |  1000 | Residential | 4-25-16   | 75836 |
|   "city": "Lexington",  |       |             |           |       |
|   "zip": "40503"        |       |             |           |       |
| }                       |       |             |           |       |
| {                       |  1103 | Residential | 6-18-16   | 92567 |
|   "city": "Belmont",    |       |             |           |       |
|   "zip": "02478"        |       |             |           |       |
| }                       |       |             |           |       |
| {                       |  1122 | Residential | 1-31-16   | 89921 |
|   "city": "Winchester", |       |             |           |       |
|   "zip": "01890"        |       |             |           |       |
| }                       |       |             |           |       |
+-------------------------+-------+-------------+-----------+-------+

Splitting JSON Elements and Loading as VARIANT Values into Separate Columns

Following the instructions in Loading JSON Data into Separate Columns, you can load individual JSON elements into different columns in your destination table. Additionally, using the SPLIT function, you can split element values that contain a separator and load them as an array.

For example, split IP addresses on the dot separator in repeating JSON elements. Load the IP addresses as arrays in separate columns:

-- Create an internal stage with the file delimiter set as none and the record delimiter set as the new line character
create or replace stage mystage
  file_format = (type = 'csv' field_delimiter = none record_delimiter = '\\n');

-- Stage a JSON data file in the internal stage
put file:///tmp/ipaddress.json @mystage auto_compress=true;

-- Query the staged data
select t.$1 from @mystage/ipaddress.json.gz t;

+----------------------------------------------------------------------+
| $1                                                                   |
|----------------------------------------------------------------------|
| {"ip_address": {"router1": "192.168.1.1","router2": "192.168.0.1"}}, |
| {"ip_address": {"router1": "192.168.2.1","router2": "192.168.3.1"}}  |
+----------------------------------------------------------------------+

-- Create a destination table for the JSON data
create or replace table splitjson (
  col1 array,
  col2 array
  );

-- Split the JSON elements into individual arrays using the SPLIT function and load them into separate columns
-- Note that all JSON data is stored in a single column ($1)
copy into splitjson(col1, col2)
from (
  select split(parse_json($1):ip_address.router1, '.'),split(parse_json($1):ip_address.router2, '.')
  from @mystage/ipaddress.json.gz t
);

-- Query the destination table
select * from splitjson;

+----------+----------+
| COL1     | COL2     |
|----------+----------|
| [        | [        |
|   "192", |   "192", |
|   "168", |   "168", |
|   "1",   |   "0",   |
|   "1"    |   "1"    |
| ]        | ]        |
| [        | [        |
|   "192", |   "192", |
|   "168", |   "168", |
|   "2",   |   "3",   |
|   "1"    |   "1"    |
| ]        | ]        |
+----------+----------+

Loading Parquet Data into Separate Columns

The following example loads elements in a Parquet file into separate table columns in a relational table with different data types:

-- Create a file format object that sets the file format type. Accept the default options.
create or replace file format my_parquet_format
  type = 'parquet';

-- Create an internal stage and specify the new file format
create or replace temporary stage mystage
  file_format = my_parquet_format;

-- Create a destination table for the data.
create or replace table parquet_col (
  custKey number default NULL,
  orderDate date default NULL,
  orderStatus varchar(100) default NULL,
  price varchar(255)
);

-- Stage a data file in the internal stage
put file:///tmp/mydata.parquet @%parquet_col;

-- Copy data from elements in the staged Parquet file into separate columns
-- in the destination table.
-- Note that all Parquet data is stored in a single column ($1)
-- SELECT list items correspond to element names in the Parquet file
-- Cast element values to the target column data type
copy into parquet_col
  from (select
  $1:o_custkey::number,
  $1:o_orderdate::date,
  $1:o_orderstatus::varchar,
  $1:o_totalprice::varchar
  from @%parquet_col/mydata.parquet);

-- Query the destination table
SELECT * from parquet_col;

+---------+------------+-------------+-----------+
| CUSTKEY | ORDERDATE  | ORDERSTATUS | PRICE     |
|---------+------------+-------------+-----------|
|   27676 | 1996-09-04 | O           | 83243.94  |
|  140252 | 1994-01-09 | F           | 198402.97 |
..
+---------+------------+-------------+-----------+