Querying Staged Data

Snowflake supports using standard SQL to query data files located in an internal (i.e. Snowflake) stage or named external (i.e S3 or Azure) stage. This can be useful for inspecting/viewing the contents of the staged files, particularly before loading or after unloading data.

In addition, by referencing metadata columns in a staged file, a staged data query can return additional information, such as filename and row numbers, about the file.

Snowflake utilizes support for staged data queries to enable transforming data during loading.

Note

This functionality is primarily for performing simple queries only, particularly when loading and/or transforming data, and is not intended to replace loading data into tables and performing queries on the tables.

In this Topic:

Query Syntax and Parameters

Query staged data files using a SELECT statement with the following syntax:

SELECT [<alias>.]$<file_col_num>[.<element>] [ , [<alias>.]$<file_col_num>[.<element>] , ...  ]
  FROM { <internal_location> | <external_location> }
  [ ( FILE_FORMAT => <named_file_format> ) ]
  [ <alias> ]

For the syntax for transforming data during a load, see COPY INTO <table>.

Required Parameters:

[alias.]$file_col_num[.element] [ , [alias.]$file_col_num[.element] , ...  ]

Specifies an explicit set of fields/columns in data files staged in either an internal or external location, where:

alias
Specifies the optional “table” alias defined, if any, in the FROM clause.
file_col_num
Specifies the positional number of the field/column (in the file) that contains the data to be loaded (1 for the first field, 2 for the second field, etc.)
element
Specifies the path and element name of a repeating value (applies only to semi-structured data files).
internal_location or external_location

Specifies the location where the data files are staged:

  • internal_location is the URI specifier for the location in Snowflake where files containing data are staged:

    @[namespace.]internal_stage_name[/path] Files are in the specified named internal stage.
    @[namespace.]%table_name[/path] Files are in the stage for the specified table.
    @~[/path] Files are in the stage for the current user.
  • external_location is the URI specifier for the named external stage or external location (i.e. S3 bucket or Azure container) where files containing data are staged:

    @[namespace.]external_stage_name[/path] Files are in the specified named external stage.

Where:

  • namespace is the database and/or schema in which the internal or external stage resides. It is optional if a database and schema are currently in use within the user session; otherwise, it is required.
  • The optional path parameter restricts the set of files being queried to the files under the folder prefix. If path is specified, but no file is explicitly named in the path, all data files in the path are queried.

Note

  • The URI string for an external storage location (i.e. S3 bucket or Azure container) must be enclosed in single quotes; however, you can enclose any URI string in single quotes, which allows special characters, including spaces, in location and file names. For example:

    Internal:

    '@~/path 1/file 1.csv'

    '@%my table/path 1/file 1.csv'

    '@my stage/path 1/file 1.csv'

  • Relative path modifiers such as /./ and /../ are interpreted literally, because “paths” are literal prefixes for a name. For example:

    S3:COPY INTO mytable FROM @mystage/./../a.csv

    In these COPY statements, the system look for a file literally named ./../a.csv in the storage location.

Optional Parameters:

( FILE_FORMAT => named_file_format )

Specifies a named file format that describes the format of the staged data files to query.

Note that this parameter is optional if either of the following conditions are true:

  • The files are formatted in the default file format (CSV) with the default delimiters: , (as the field delimiter) and the new line character (as the record delimiter).
  • The files are in an internal or external stage and the stage definition describes the file format.

Otherwise, this parameter is required. For more details, see File Formats (in this topic).

alias
Specifies a “table” alias for the internal/external location where the files are staged.

Supported Functions

Snowflake currently supports the following subset of functions when querying staged data files:

File Formats

To parse a staged data file, it is necessary to describe its file format. The default file format is character-delimited UTF-8 text (i.e. CSV), with the comma character (,) as the field delimiter and new line character as the record delimiter. If the source data is in another format (JSON, Avro, etc.), you must specify the corresponding file format type (and options).

To explicitly specify file format options, set them in one of the following ways:

Querying staged data files:
 

As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the SELECT statement.

Loading columns from staged data files:
 
  • As file format options specified directly in the COPY INTO table statement.
  • As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO table statement.

Query Examples

Example 1: Querying Columns in a CSV File

The following example illustrates staging multiple CSV data files (with the same file format) and then querying the data columns in the files.

This example assumes the files have the following names and are located in the root directory in a Mac OS or Linux environment:

  • /tmp/data1.csv contains two records:

    a|b
    c|d
    
  • /tmp/data2.csv contains two records:

    e|f
    g|h
    

To stage and query the files:

-- Create a file format
CREATE OR REPLACE FILE FORMAT myformat TYPE = 'csv' FIELD_DELIMITER = '|';

-- Create an internal stage
CREATE OR REPLACE STAGE mystage1;

-- Stage the data files
PUT file:///tmp/data*.csv @mystage1;

-- Query the filename and row number metadata columns and the regular data columns in the staged file
-- Note that the table alias is provided to make the statement easier to read and is not required
SELECT t.$1, t.$2 FROM @mystage1 (file_format => myformat) t;

+----+----+
| $1 | $2 |
|----+----|
| a  | b  |
| c  | d  |
| e  | f  |
| g  | h  |
+----+----+

SELECT t.$1, t.$2 FROM @mystage1 t;

+-----+------+
| $1  | $2   |
|-----+------|
| a|b | NULL |
| c|d | NULL |
| e|f | NULL |
| g|h | NULL |
+-----+------+

Note

The file format is required in this example to correctly parse the fields in the staged files. In the second query, the file format is omitted, causing the | field delimiter to be ignored and resulting in the values returned for $1 and $2.

However, if the file format is included in the stage definition, you can omit it from the SELECT statement. See Example 3: Querying Elements in a JSON File.

Example 2: Calling Functions when Querying a Staged Data File

Get the ASCII code for the first character of each column in the data files staged in Example 1: Querying Columns in a CSV File:

SELECT ascii(t.$1), ascii(t.$2) FROM @mystage1 (file_format => myformat) t;

+-------------+-------------+
| ASCII(T.$1) | ASCII(T.$2) |
|-------------+-------------|
|          97 |          98 |
|          99 |         100 |
|         101 |         102 |
|         103 |         104 |
+-------------+-------------+

Note

If the file format is included in the stage definition, you can omit it from the SELECT statement. See Example 3: Querying Elements in a JSON File.

Example 3: Querying Elements in a JSON File

This example illustrates staging a JSON data file containing the following objects and then querying individual elements within the objects in the file:

{"a": {"b": "x1","c": "y1"}},
{"a": {"b": "x2","c": "y2"}}

This example assumes the file is named /tmp/data1.json and is located in the root directory in a Mac OS or Linux environment.

To stage and query the file:

-- Create a file format
CREATE OR REPLACE FILE FORMAT my_json_format TYPE = 'json';

-- Create an internal stage
CREATE OR REPLACE STAGE mystage2 FILE_FORMAT = my_json_format;

-- Stage the data file
PUT file:///tmp/data1.json @mystage2;

-- Query the repeating :code:`a.b` element in the staged file
SELECT parse_json($1):a.b FROM @mystage2/data1.json.gz;

+--------------------+
| PARSE_JSON($1):A.B |
|--------------------|
| "x1"               |
| "x2"               |
+--------------------+