Querying Semi-structured Data

Snowflake supports SQL queries that access semi-structured data using special operators and functions. Note that this topic applies to JSON, Avro, ORC, and Parquet data. The information does not apply to XML.

In this Topic:

Sample Data Used in Examples

Except where noted, the examples in this topic refer to a table named customers that contains a single VARIANT column named src:

CREATE OR REPLACE TABLE customers
(
  SRC Variant
)
AS
SELECT PARSE_JSON(column1) AS src
FROM VALUES
('{
    "name": "Patrick Jones",
    "email": "jones.patrick@orbin.ca",
    "phone": {
      "cell": "555-407-1234",
      "office": "555-407-5678"
    },
    "products_purchased":[
        {"id":"1", "version":"2.4"},
        {"id":"2", "version":"1.3"}
    ]
}'),
('{
    "name": "Anna Glass",
    "email": "glass.anna@snips.ca",
    "phone": {
      "cell": "555-407-9012",
      "office": "555-407-3456"
    },
    "products_purchased":[
        {"id":"1", "version":"2.7"}
    ]
}') v;

SELECT * FROM customers;

+--------------------------------------+
| SRC                                  |
|--------------------------------------|
| {                                    |
|   "email": "jones.patrick@orbin.ca", |
|   "name": "Patrick Jones",           |
|   "phone": {                         |
|     "cell": "555-407-1234",          |
|     "office": "555-407-5678"         |
|   },                                 |
|   "products_purchased": [            |
|     {                                |
|       "id": "1",                     |
|       "version": "2.4"               |
|     },                               |
|     {                                |
|       "id": "2",                     |
|       "version": "1.3"               |
|     }                                |
|   ]                                  |
| }                                    |
| {                                    |
|   "email": "glass.anna@snips.ca",    |
|   "name": "Anna Glass",              |
|   "phone": {                         |
|     "cell": "555-407-9012",          |
|     "office": "555-407-3456"         |
|   },                                 |
|   "products_purchased": [            |
|     {                                |
|       "id": "1",                     |
|       "version": "2.7"               |
|     }                                |
|   ]                                  |
| }                                    |
+--------------------------------------+

Traversing Semi-structured Data

Insert a colon : between the VARIANT column name and a first-level element: <column>:<level1_element>.

Note

In the following examples, the query output is enclosed in double quotes. Operators : and subsequent . and [] always return VARIANT values containing strings by default, not VARCHAR values.

Get all customer names:

SELECT src:name FROM customers;

+-----------------+
| SRC:NAME        |
|-----------------|
| "Patrick Jones" |
| "Anna Glass"    |
+-----------------+

Use dot-notation to traverse a path in a JSON object: <column>:<level1_element>.<level2_element>.<level3_element>. Optionally enclose element names in double quotes: <column>:"<level1_element>"."<level2_element>"."<level3_element>”.

Note

If an element name contains spaces, you must enclose the name in double quotes, e.g.:

SELECT src:"company name" FROM partners;

Get the cell phone numbers for all customers:

SELECT src:phone.cell FROM customers;

+----------------+
| SRC:PHONE.CELL |
|----------------|
| "555-407-1234" |
| "555-407-9012" |
+----------------+

Alternatively, use bracket-notation to traverse the path in an object: <column>['<level1_element>']['<level2_element>']. Values are retrieved as strings. Enclose element names in single quotes.

Get the cell phone numbers for all customers:

SELECT src['phone']['cell'] FROM customers;

+----------------------+
| SRC['PHONE']['CELL'] |
|----------------------|
| "555-407-1234"       |
| "555-407-9012"       |
+----------------------+

Note

Regardless of which notation you use, the column name is case-insensitive but element names are case-sensitive. For example, in the following list, the first two paths are equivalent, but the third is not:

  • src:phone.cell
  • SRC:phone.cell
  • src:Phone.Cell

Retrieving an Instance of a Repeating Element

Retrieve a specific numbered instance of a child element in a repeating array by adding a numbered predicate (starting from 0) to the array reference.

Get the ID and version of the first product purchased by customers:

SELECT src:products_purchased[0] FROM customers;

+---------------------------+
| SRC:PRODUCTS_PURCHASED[0] |
|---------------------------|
| {                         |
|   "id": "1",              |
|   "version": "2.4"        |
| }                         |
| {                         |
|   "id": "1",              |
|   "version": "2.7"        |
| }                         |
+---------------------------+

Get the version of the first product purchased by each customer:

SELECT src:products_purchased[0].version FROM customers;

+-----------------------------------+
| SRC:PRODUCTS_PURCHASED[0].VERSION |
|-----------------------------------|
| "2.4"                             |
| "2.7"                             |
+-----------------------------------+

Explicitly Casting Values as Data Types

By default, values retrieved from a VARIANT column are returned as string literals. To avoid unexpected results, explicitly cast the values to the desired data type (using :: notation).

Get all customer cell phone numbers as string values:

SELECT src:phone.cell::string FROM customers;

+------------------------+
| SRC:PHONE.CELL::STRING |
|------------------------|
| 555-407-1234           |
| 555-407-9012           |
+------------------------+

Using the FLATTEN Function to Parse Arrays

Parse an array using the FLATTEN function. FLATTEN is a table function that produces a lateral view of a VARIANT, OBJECT, or ARRAY column. The function returns a row for each object, and the LATERAL modifier joins the data with any information outside of the object.

Get the IDs and versions of customer-purchased products. Cast the VARIANT output to number or string values:

SELECT
  value:id::number,value:version::string
  FROM
    customers
  , LATERAL FLATTEN(INPUT => SRC:products_purchased);

+------------------+-----------------------+
| VALUE:ID::NUMBER | VALUE:VERSION::STRING |
|------------------+-----------------------|
|                1 | 2.4                   |
|                2 | 1.3                   |
|                1 | 2.7                   |
+------------------+-----------------------+

Parsing Text as VARIANT Values Using the PARSE_JSON Function

Parse text as a JSON document, producing a VARIANT value, using the PARSE_JSON function.

If the input is NULL, the output will also be NULL. However, if the input string is null, it is interpreted as a VARIANT null value; that is, the result is not a SQL NULL but a real value used to represent a null value in semi-structured formats.

INSERT INTO customers
  SELECT PARSE_JSON(column1)
  FROM VALUES
  ('{
    "name": "Ian Fields",
    "email": "ifields@lectus.ca",
    "phone": {
      "cell": "555-745-9463",
      "office": "555-745-2394"
    },
    "products_purchased":[
        {"id":"1", "version":"2.8"}
    ]
    }');

select * from customers where contains(src, 'fields');

+---------------------------------+
| SRC                             |
|---------------------------------|
| {                               |
|   "email": "ifields@lectus.ca", |
|   "name": "Ian Fields",         |
|   "phone": {                    |
|     "cell": "555-745-9463",     |
|     "office": "555-745-2394"    |
|   },                            |
|   "products_purchased": [       |
|     {                           |
|       "id": "1",                |
|       "version": "2.8"          |
|     }                           |
|   ]                             |
| }                               |
+---------------------------------+

Extracting Values Using the GET Function

GET accepts a VARIANT, OBJECT, or ARRAY value as the first argument and extracts the VARIANT value of the element in the path provided as the second argument.

Compute and extract the last element of each array in a VARIANT column using the GET and ARRAY_SIZE functions. ARRAY_SIZE returns the size of the input array:

Note

This example departs from the standard customers table used elsewhere in this topic.

CREATE OR replace TABLE colors (v variant);

INSERT INTO
   colors
   SELECT
      parse_json(column1) AS v
   FROM
   VALUES
     ('[{r:255,g:12,b:0},{r:0,g:255,b:0},{r:0,g:0,b:255}]'),
     ('[{c:0,m:1,y:1,k:0},{c:1,m:0,y:1,k:0},{c:1,m:1,y:0,k:0}]')
    v;

SELECT *, GET(v, ARRAY_SIZE(v)-1) FROM colors;

+---------------+-------------------------+
| V             | GET(V, ARRAY_SIZE(V)-1) |
|---------------+-------------------------|
| [             | {                       |
|   {           |   "b": 255,             |
|     "b": 0,   |   "g": 0,               |
|     "g": 12,  |   "r": 0                |
|     "r": 255  | }                       |
|   },          |                         |
|   {           |                         |
|     "b": 0,   |                         |
|     "g": 255, |                         |
|     "r": 0    |                         |
|   },          |                         |
|   {           |                         |
|     "b": 255, |                         |
|     "g": 0,   |                         |
|     "r": 0    |                         |
|   }           |                         |
| ]             |                         |
| [             | {                       |
|   {           |   "c": 1,               |
|     "c": 0,   |   "k": 0,               |
|     "k": 0,   |   "m": 1,               |
|     "m": 1,   |   "y": 0                |
|     "y": 1    | }                       |
|   },          |                         |
|   {           |                         |
|     "c": 1,   |                         |
|     "k": 0,   |                         |
|     "m": 0,   |                         |
|     "y": 1    |                         |
|   },          |                         |
|   {           |                         |
|     "c": 1,   |                         |
|     "k": 0,   |                         |
|     "m": 1,   |                         |
|     "y": 0    |                         |
|   }           |                         |
| ]             |                         |
+---------------+-------------------------+

Extracting Values by Path Using the GET_PATH Function

Extract a value from a VARIANT column using the GET_PATH , : function. The function is a variation of GET, used to extract a value using a path name. GET_PATH is equivalent to a chain of GET functions.

Get the version numbers for the first product purchased by each customer:

SELECT GET_PATH(src, 'products_purchased[0]:version') FROM customers;

+------------------------------------------------+
| GET_PATH(SRC, 'PRODUCTS_PURCHASED[0]:VERSION') |
|------------------------------------------------|
| "2.4"                                          |
| "2.7"                                          |
+------------------------------------------------+

Traversing Semi-structured Data describes the path syntax used to retrieve elements in a VARIANT column. The syntax is shorthand for the GET or GET_PATH , : function. Unlike the path syntax, these functions can handle irregular paths or path elements.

The following queries produce the same results:

SELECT GET_PATH(src, 'products_purchased[0].version') FROM customers;

SELECT src:products_purchased[0].version FROM customers;

The following query also produces the same results, but uses SQL-style double-quoted identifiers and : as the path separator:

SELECT GET_PATH('src:"products_purchased"[0]:"version"') FROM customers;

Parsing Arrays Directly from an Uploaded Data File

Assume a file named contacts.json.gz containing the following data is staged in the staging location for the customers table:

{
    "root": [
        {
            "employees": [
                {
                    "firstName": "Anna",
                    "lastName": "Smith"
                },
                {
                    "firstName": "Peter",
                    "lastName": "Jones"
                }
            ]
        }
    ]
}

Also assume a file format named my_json_format created in the same database as the customers table describes the semi-structured data type.

Query the name of the first employee in the staged file:

SELECT 'The First Employee Record is '||
    S.$1:root[0].employees[0].firstName||
    ' '||S.$1:root[0].employees[0].lastName
FROM @%customers/contacts.json.gz (file_format => 'my_json_format') as S;

+----------------------------------------------+
| 'THE FIRST EMPLOYEE RECORD IS '||            |
|      S.$1:ROOT[0].EMPLOYEES[0].FIRSTNAME||   |
|      ' '||S.$1:ROOT[0].EMPLOYEES[0].LASTNAME |
|----------------------------------------------|
| The First Employee Record is Anna Smith      |
+----------------------------------------------+