Script: Loading and Unloading Parquet Data

The annotated scripts in this tutorial describe a Parquet data workflow:

  • Script 1. Loads sample Parquet data into separate columns in a relational table directly from staged data files, avoiding the need for a staging table. In this example, COPY INTO <table> loads separate data into a pair of VARCHAR columns and a VARIANT column.

  • Script 2. Unloads relational Snowflake table data into separate columns in a Parquet file. (When the Parquet file type is specified, COPY INTO <location> unloads to a single column by default.)

The following is a representative row in the sample Parquet file:

{
  "continent": "Europe",
  "country": {
    "city": {
      "bag": [
        {
          "array_element": "Paris"
        },
        {
          "array_element": "Nice"
        },
        {
          "array_element": "Marseilles"
        },
        {
          "array_element": "Cannes"
        }
      ]
    },
    "name": "France"
  }
}

Execute the script using the Snowflake CLI client (SnowSQL). Note that PUT cannot be executed from the Worksheets Worksheet tab page in the Snowflake web interface.

In this Topic:

Prerequisites

  • Requires an active, running virtual warehouse.

  • Download the sample Parquet data file. Right-click the name of the file, cities.parquet, and save the link/file to your local file system.

    Copy the file to your temporary folder:

    • macOS or Linux: //tmp

    • Windows: Open an Explorer window, and enter %TEMP% in the address bar.

SQL Script 1. Loading Parquet Data

Note

The PUT statement in this script assumes you are using a macOS or Linux environment. If you are using Windows, adjust the statement as noted in the statement comments.

/* Create a target relational table for the Parquet data. The table is temporary, meaning it persists only for the duration
of the user session and is not visible to other users. */

create or replace temporary table cities (
  continent varchar default NULL,
  country varchar default NULL,
  city variant default NULL
);

-- Create a file format object that specifies the Parquet file format type. Accepts the default values for other options.
create or replace file format sf_tut_parquet_format
  type = 'parquet';

/* Create a temporary internal stage that references the file format object.

  Similar to temporary tables, temporary stages are automatically dropped at the end of the session. */

create or replace temporary stage sf_tut_stage
  file_format = sf_tut_parquet_format;

/* Stage the data file.

Note that the example PUT statement references the macOS or Linux location of the data file.
If you are using Windows, execute the following statement instead:
put %TEMP%/cities.parquet @sf_tut_stage; */

put file:///tmp/cities.parquet @sf_tut_stage;

/* Load the Parquet data into the relational table.

A SELECT query in the COPY statement identifies a numbered set of columns in the data files you are loading from. Note that all Parquet data is stored in a single column ($1).

Cast element values to the target column data type. */

copy into cities
  from (select
  $1:continent::varchar,
  $1:country:name::varchar,
  $1:country:city.bag::variant
  from @sf_tut_stage/cities.parquet);

-- Query the relational table
SELECT * from cities;

SQL Script 2. Unloading Parquet Data

/* Unload the CITIES table columns into a Parquet file. Optionally flatten the CITY column array and unload the child elements to a separate column. */
/* To retain the table schema in the output file, use a simple SELECT statement, e.g. SELECT * FROM cities. */
/* To retain the column names in the output file, use the HEADER = TRUE copy option. */

copy into @sf_tut_stage/out/parquet_ from (
select
  continent
  , country
  , c.value:array_element::string
  from
    cities
  , lateral flatten(input => city) c)
  file_format = (type = 'parquet')
  header = true;

-- Query the staged Parquet file.
select t.$1 from @sf_tut_stage/out/ t;