Semi-structured Data Considerations

This topic provides best practices, general guidelines, and important considerations for loading and working with JSON, Avro, ORC, and Parquet data. The information does not necessarily apply to XML.

In this Topic:

Data Size Limitations

The VARIANT data type imposes a size limit on individual rows that depends on the depth and complexity of the data structure. Limit the size of JSON or Avro objects to less than 16 MB (ideally, 8 MB).

In general, JSON and Avro data sets are a simple concatenation of multiple documents. There is no need to separate the documents with line breaks or commas, though both are supported.

The JSON or Avro output from some software is composed of a single huge array containing multiple records. We recommend enabling the SKIP_OUTER_ARRAY file format option for the COPY INTO table command to remove the outer array structure and load the records into separate table rows:

COPY INTO <table>
FROM @~/<file>.json
FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = true);

Storing Semi-structured Data in a VARIANT Column vs. Flattening the Nested Structure

If you aren’t sure yet what types of operations you’ll perform on your semi-structured data, we recommend storing it in a VARIANT column for now. For data that is mostly regular and uses only native types (strings and integers), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar.

For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes:

  • Dates and timestamps, especially non-ISO 8601 dates and timestamps, as string values
  • Numbers within strings
  • Arrays

Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

If you know your use cases for the data, perform tests on a typical data set. Load the data set into a VARIANT column in a table. Use the FLATTEN function to extract the objects and keys you plan to query into a separate table. Run a typical set of queries against both tables to see which structure provides the best performance.

Casting Key-Values

When extracting key-values from a VARIANT column, cast the values to the desired data type (using the :: notation) to avoid unexpected results. For example, when you extract a string key-value without casting, the results are double-quoted (to show that the VARIANT value contains a string and not a different type; i.e. "1" is a string while 1 is a number):

SELECT col1:city;

+----------------------+
| CITY                 |
|----------------------|
| "Los Angeles"        |
+----------------------+

SELECT col1:city::string;

+----------------------+
| CITY                 |
|----------------------|
| Los Angeles          |
+----------------------+

Parsing NULL Values

When converted to a string, VARIANT “null” values become a string containing the word “null,” not the SQL NULL value. This behavior allows distinguishing “null” values from values that are absent, which produce a SQL NULL.

To output a SQL NULL value from a VARIANT “null” key-value, use the AS_CHAR , AS_VARCHAR function to cast the value as a string.

The following example illustrates the different options:

SELECT column1
  , TO_VARCHAR(PARSE_JSON(column1):a)
  , AS_VARCHAR(PARSE_JSON(column1):a)
FROM
  VALUES('{"a" : null}')
, ('{"b" : "hello"}')
, ('{"a" : "world"}');

+-----------------+-----------------------------------+-----------------------------------+
| COLUMN1         | TO_VARCHAR(PARSE_JSON(COLUMN1):A) | AS_VARCHAR(PARSE_JSON(COLUMN1):A) |
|-----------------+-----------------------------------+-----------------------------------|
| {"a" : null}    | null                              | NULL                              |
| {"b" : "hello"} | NULL                              | NULL                              |
| {"a" : "world"} | world                             | world                             |
+-----------------+-----------------------------------+-----------------------------------+

If the “null” values in your documents indicate missing values and have no other special meaning, we recommend setting the file format option STRIP_NULL_VALUES to TRUE for the COPY INTO table command when loading the semi-structured data files. Retaining the “null” values often wastes storage and slows query processing.

Using FLATTEN to Filter the Results in a WHERE Clause

The FLATTEN function explodes nested values into separate columns. You can use the function to filter query results in a WHERE clause.

The following example returns key-value pairs that match a WHERE clause and displays them in separate columns:

CREATE TABLE pets (v variant);

INSERT INTO pets SELECT PARSE_JSON ('{"species":"dog", "name":"Fido", "is_dog":"true"} ');
INSERT INTO pets SELECT PARSE_JSON ('{"species":"cat", "name":"Bubby", "is_dog":"false"}');
INSERT INTO pets SELECT PARSE_JSON ('{"species":"cat", "name":"dog terror", "is_dog":"false"}');

SELECT a.v, b.key, b.value FROM pets a,LATERAL FLATTEN(input => a.v) b
WHERE b.value LIKE '%dog%';

+-------------------------+---------+--------------+
| V                       | KEY     | VALUE        |
|-------------------------+---------+--------------|
| {                       | species | "dog"        |
|   "is_dog": "true",     |         |              |
|   "name": "Fido",       |         |              |
|   "species": "dog"      |         |              |
| }                       |         |              |
| {                       | name    | "dog terror" |
|   "is_dog": "false",    |         |              |
|   "name": "dog terror", |         |              |
|   "species": "cat"      |         |              |
| }                       |         |              |
+-------------------------+---------+--------------+

Using FLATTEN to List Distinct Key Names

When working with unfamiliar semi-structured data, you might not know the key names in an object. You can use the FLATTEN function with the RECURSIVE argument to return the list of distinct key names in all nested elements in an object:

SELECT REGEXP_REPLACE(f.path, '\\[[0-9]+\\]', '[]') AS "Path",
  TYPEOF(f.value) AS "Type",
  COUNT(*) AS "Count"
FROM <table>,
LATERAL FLATTEN(<variant_column>, RECURSIVE=>true) f
GROUP BY 1, 2 ORDER BY 1, 2;

The REGEXP_REPLACE function removes the array index values (e.g. [0]) and replaces them with brackets ([]) to group array elements.

For example:

{"a": 1, "b": 2, "special" : "data"}   <--- row 1 of VARIANT column
{"c": 3, "d": 4, "normal" : "data"}    <----row 2 of VARIANT column

Output from query:

+---------+---------+-------+
| Path    | Type    | Count |
|---------+---------+-------|
| a       | INTEGER |     1 |
| b       | INTEGER |     1 |
| c       | INTEGER |     1 |
| d       | INTEGER |     1 |
| normal  | VARCHAR |     1 |
| special | VARCHAR |     1 |
+---------+---------+-------+

Using FLATTEN to List Paths in Objects

Related to Using FLATTEN to List Distinct Key Names, you can use the FLATTEN function with the RECURSIVE argument to retrieve all keys and paths in an object.

The following query returns keys, paths, and values (including VARIANT “null” values) for all data types stored in objects:

SELECT
  t.<variant_column>,
  f.seq,
  f.key,
  f.path,
  REGEXP_COUNT(f.path,'\\.|\\[') +1 AS Level,
  TYPEOF(f.value) AS "Type",
  f.index,
  f.value AS "Current Level Value",
  f.this AS "Above Level Value"
FROM <table> t,
LATERAL FLATTEN(t.<variant_column>, recursive=>true) f;

The following query is similar to the first query, but excludes nested objects and arrays:

SELECT
  t.<variant_column>,
  f.seq,
  f.key,
  f.path,
  REGEXP_COUNT(f.path,'\\.|\\[') +1 AS Level,
  TYPEOF(f.value) AS "Type",
  f.value AS "Current Level Value",
  f.this AS "Above Level Value"
FROM <table> t,
LATERAL FLATTEN(t.<variant_column>, recursive=>true) f
WHERE TYPE NOT IN ('OBJECT','ARRAY');

The queries return the following values:

<variant_column>
Object stored as a row in the VARIANT column.
Seq
Unique sequence number associated with the data in the row.
Key
String associated with a value in the data structure.
Path
Path to the element within the data structure.
Level
Level of the key-value pair within the data structure.
Type
Data type for the value.
Index
Index of the element in the data structure. Applies to arrays only; otherwise NULL.
Current Level Value
Value at the current level in the data structure.
Above Level Value
Value one level higher in the data structure.