Semi-structured Data Considerations

This topic provides best practices, general guidelines, and important considerations for loading and working with JSON, Avro, ORC, and Parquet data. The information in this topic does not necessarily apply to XML data.

In this Topic:

Data Size Limitations

The VARIANT data type imposes a 16MB (compressed) size limit on individual rows.

In general, JSON and Avro data sets are a simple concatenation of multiple documents. The JSON or Avro output from some software is composed of a single huge array containing multiple records. There is no need to separate the documents with line breaks or commas, though both are supported.

Instead, we recommend enabling the STRIP_OUTER_ARRAY file format option for the COPY INTO table command to remove the outer array structure and load the records into separate table rows:

COPY INTO <table>
FROM @~/<file>.json
FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = true);

Storing Semi-structured Data in a VARIANT Column vs. Flattening the Nested Structure

If you aren’t sure yet what types of operations you’ll perform on your semi-structured data, we recommend storing it in a VARIANT column for now. For data that is mostly regular and uses only native types (strings and integers), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar.

For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes:

  • Dates and timestamps, especially non-ISO 8601 dates and timestamps, as string values
  • Numbers within strings
  • Arrays

Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

If you know your use cases for the data, perform tests on a typical data set. Load the data set into a VARIANT column in a table. Use the FLATTEN function to extract the objects and keys you plan to query into a separate table. Run a typical set of queries against both tables to see which structure provides the best performance.

Casting Key-Values

When extracting key-values from a VARIANT column, cast the values to the desired data type (using the :: notation) to avoid unexpected results. For example, when you extract a string key-value without casting, the results are double-quoted (to show that the VARIANT value contains a string and not a different type; i.e. "1" is a string while 1 is a number):

SELECT col1:city;

+----------------------+
| CITY                 |
|----------------------|
| "Los Angeles"        |
+----------------------+

SELECT col1:city::string;

+----------------------+
| CITY                 |
|----------------------|
| Los Angeles          |
+----------------------+

NULL Values

When converted to a string, VARIANT “null” values become a string containing the word “null,” not the SQL NULL value. This behavior allows distinguishing “null” values from values that are absent, which produce a SQL NULL.

Performance Impact of NULL Values

When semi-structured data is inserted into a VARIANT column, Snowflake extracts as much of the data as possible to a columnar form, based on certain rules. The rest is stored as a single column in a parsed semi-structured structure. Currently, fields that contain even a single “null” value are not extracted into a column. This rule ensures that information is not lost, i.e, the difference between VARIANT “null” values and SQL NULL values is not obfuscated.

When a semi-structured field is queried:

  • If the field was extracted into a column, Snowflake’s execution engine (which is columnar) scans only the extracted column.
  • If the field was not extracted into a column, the execution engine must scan the entire JSON structure, and then for each row traverse the structure to output values, impacting performance.

To avoid this performance impact, extract semi-structured data fields containing “null” values into relational columns before loading them.

Alternatively, if the “null” values in your files indicate missing values and have no other special meaning, we recommend setting the file format option STRIP_NULL_VALUES to TRUE when loading the semi-structured data files. This option removes object fields or array elements containing “null” values.

Parsing NULL Values

To output a SQL NULL value from a VARIANT “null” key-value, use the AS_CHAR , AS_VARCHAR function to cast the value as a string.

The following example illustrates the different options:

SELECT column1
  , TO_VARCHAR(PARSE_JSON(column1):a)
  , AS_VARCHAR(PARSE_JSON(column1):a)
FROM
  VALUES('{"a" : null}')
, ('{"b" : "hello"}')
, ('{"a" : "world"}');

+-----------------+-----------------------------------+-----------------------------------+
| COLUMN1         | TO_VARCHAR(PARSE_JSON(COLUMN1):A) | AS_VARCHAR(PARSE_JSON(COLUMN1):A) |
|-----------------+-----------------------------------+-----------------------------------|
| {"a" : null}    | null                              | NULL                              |
| {"b" : "hello"} | NULL                              | NULL                              |
| {"a" : "world"} | world                             | world                             |
+-----------------+-----------------------------------+-----------------------------------+

Using FLATTEN to Filter the Results in a WHERE Clause

The FLATTEN function explodes nested values into separate columns. You can use the function to filter query results in a WHERE clause.

The following example returns key-value pairs that match a WHERE clause and displays them in separate columns:

CREATE TABLE pets (v variant);

INSERT INTO pets SELECT PARSE_JSON ('{"species":"dog", "name":"Fido", "is_dog":"true"} ');
INSERT INTO pets SELECT PARSE_JSON ('{"species":"cat", "name":"Bubby", "is_dog":"false"}');
INSERT INTO pets SELECT PARSE_JSON ('{"species":"cat", "name":"dog terror", "is_dog":"false"}');

SELECT a.v, b.key, b.value FROM pets a,LATERAL FLATTEN(input => a.v) b
WHERE b.value LIKE '%dog%';

+-------------------------+---------+--------------+
| V                       | KEY     | VALUE        |
|-------------------------+---------+--------------|
| {                       | species | "dog"        |
|   "is_dog": "true",     |         |              |
|   "name": "Fido",       |         |              |
|   "species": "dog"      |         |              |
| }                       |         |              |
| {                       | name    | "dog terror" |
|   "is_dog": "false",    |         |              |
|   "name": "dog terror", |         |              |
|   "species": "cat"      |         |              |
| }                       |         |              |
+-------------------------+---------+--------------+

Using FLATTEN to List Distinct Key Names

When working with unfamiliar semi-structured data, you might not know the key names in an object. You can use the FLATTEN function with the RECURSIVE argument to return the list of distinct key names in all nested elements in an object:

SELECT REGEXP_REPLACE(f.path, '\\[[0-9]+\\]', '[]') AS "Path",
  TYPEOF(f.value) AS "Type",
  COUNT(*) AS "Count"
FROM <table>,
LATERAL FLATTEN(<variant_column>, RECURSIVE=>true) f
GROUP BY 1, 2 ORDER BY 1, 2;

The REGEXP_REPLACE function removes the array index values (e.g. [0]) and replaces them with brackets ([]) to group array elements.

For example:

{"a": 1, "b": 2, "special" : "data"}   <--- row 1 of VARIANT column
{"c": 3, "d": 4, "normal" : "data"}    <----row 2 of VARIANT column

Output from query:

+---------+---------+-------+
| Path    | Type    | Count |
|---------+---------+-------|
| a       | INTEGER |     1 |
| b       | INTEGER |     1 |
| c       | INTEGER |     1 |
| d       | INTEGER |     1 |
| normal  | VARCHAR |     1 |
| special | VARCHAR |     1 |
+---------+---------+-------+

Using FLATTEN to List Paths in Objects

Related to Using FLATTEN to List Distinct Key Names, you can use the FLATTEN function with the RECURSIVE argument to retrieve all keys and paths in an object.

The following query returns keys, paths, and values (including VARIANT “null” values) for all data types stored in objects:

SELECT
  t.<variant_column>,
  f.seq,
  f.key,
  f.path,
  REGEXP_COUNT(f.path,'\\.|\\[') +1 AS Level,
  TYPEOF(f.value) AS "Type",
  f.index,
  f.value AS "Current Level Value",
  f.this AS "Above Level Value"
FROM <table> t,
LATERAL FLATTEN(t.<variant_column>, recursive=>true) f;

The following query is similar to the first query, but excludes nested objects and arrays:

SELECT
  t.<variant_column>,
  f.seq,
  f.key,
  f.path,
  REGEXP_COUNT(f.path,'\\.|\\[') +1 AS Level,
  TYPEOF(f.value) AS "Type",
  f.value AS "Current Level Value",
  f.this AS "Above Level Value"
FROM <table> t,
LATERAL FLATTEN(t.<variant_column>, recursive=>true) f
WHERE TYPE NOT IN ('OBJECT','ARRAY');

The queries return the following values:

<variant_column>
Object stored as a row in the VARIANT column.
Seq
Unique sequence number associated with the data in the row.
Key
String associated with a value in the data structure.
Path
Path to the element within the data structure.
Level
Level of the key-value pair within the data structure.
Type
Data type for the value.
Index
Index of the element in the data structure. Applies to arrays only; otherwise NULL.
Current Level Value
Value at the current level in the data structure.
Above Level Value
Value one level higher in the data structure.