Introduction to Semi-structured Data

Semi-structured data is data that does not conform to the standards of traditional structured data, but it contains tags or other types of mark-up that identify individual, distinct entities within the data.

Two of the key attributes that distinguish semi-structured data from structured data are the lack of a fixed schema and nested data structures:

  • Structured data requires a fixed schema that is defined before the data can be loaded and queried in a relational database system. Semi-structured data does not require a prior definition of a schema and can constantly evolve, i.e. new attributes can be added at any time.

    In addition, entities within the same class may have different attributes even though they are grouped together, and the order of the attributes is not important.

  • Unlike structured data, which represents data as a flat table, semi-structured data can contain n-level hierarchies of nested information.

In this Topic:

What is JSON?

JSON (JavaScript Object Notation) is a lightweight, plain-text, data-interchange format based on a subset of the JavaScript Programming Language.

JSON data can be produced by any application. Some common examples include:

  • JavaScript applications using native methods to generate JSON.
  • Non-JavaScript applications using libraries (usually with extensions) to generate JSON data.
  • Ad hoc JavaScript generators.
  • Concatenation of JSON documents (which may or may not be line-separated).

Because there is no formal specification, there are significant differences between various implementations. These differences makes import of JSON-like datasets impossible if the JSON parser is strict in its language definition. To make import of JSON datasets as problem-free as possible, Snowflake follows the rule “be liberal in what you accept”. The intent is to accept the widest possible range of JSON and JSON-like inputs that permit unambiguous interpretation.

This topic describes the syntax for JSON documents accepted by Snowflake.

For more information about JSON, see json.org.

Basic JSON Syntax

JSON data is a hierarchical collection of name/value pairs grouped into objects and arrays:

  • Colons : separate names and values in name/value pairs.
  • Curly braces {} denote objects.
  • Square brackets [] denote arrays.
  • Commas , separate entities in objects and arrays.

Name/Value Pairs

JSON name/value pairs consist of a field name (in double quotes), followed by a colon, then a value.

For example:

"firstName":"John"

"empid":45611

Supported Data Types

A value in a name/value pair can be:

  • A number (integer or floating point)
  • A string (in double quotes)
  • A Boolean (true or false)
  • An array (in square brackets)
  • An object (in curly braces)
  • Null

Objects

JSON objects are written inside curly braces. An object can contain multiple name/values pairs, separated by commas. For example:

{"firstName":"John", "lastName":"Doe"}

Arrays

JSON arrays are written inside square brackets. An array can contain multiple objects, separated by commas. For example:

"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
  ]

Examples of JSON Documents

FILE NAME: json_sample_data1

Contains an array with 3 simple employee records (objects):

{"root":[{"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
]}]}

FILE NAME: json_sample_data2

Contains an array with 3 employee records (objects) and their associated dependent data (children, the children’s names and ages, cities where the employee has lived, and the years that the employee has lived in those cities):

{"root":
   [
    { "kind": "person",
      "fullName": "John Doe",
      "age": 22,
      "gender": "Male",
      "phoneNumber":
        {"areaCode": "206",
         "number": "1234567"},
      "children":
         [
           {
             "name": "Jane",
             "gender": "Female",
             "age": "6"
           },
           {
              "name": "John",
              "gender": "Male",
              "age": "15"
           }
         ],
      "citiesLived":
         [
            {
               "place": "Seattle",
               "yearsLived": ["1995"]
            },
            {
               "place": "Stockholm",
               "yearsLived": ["2005"]
            }
         ]
      },

      {"kind": "person", "fullName": "Mike Jones", "age": 35, "gender": "Male", "phoneNumber": { "areaCode": "622", "number": "1567845"}, "children": [{ "name": "Earl", "gender": "Male", "age": "10"}, {"name": "Sam", "gender": "Male", "age": "6"}, { "name": "Kit", "gender": "Male", "age": "8"}], "citiesLived": [{"place": "Los Angeles", "yearsLived": ["1989", "1993", "1998", "2002"]}, {"place": "Washington DC", "yearsLived": ["1990", "1993", "1998", "2008"]}, {"place": "Portland", "yearsLived": ["1993", "1998", "2003", "2005"]}, {"place": "Austin", "yearsLived": ["1973", "1998", "2001", "2005"]}]},

      {"kind": "person", "fullName": "Anna Karenina", "age": 45, "gender": "Female", "phoneNumber": { "areaCode": "425", "number": "1984783"}, "citiesLived": [{"place": "Stockholm", "yearsLived": ["1992", "1998", "2000", "2010"]}, {"place": "Russia", "yearsLived": ["1998", "2001", "2005"]}, {"place": "Austin", "yearsLived": ["1995", "1999"]}]
    }]
  }

What is Avro?

Avro is an open-source data serialization and RPC framework originally developed for use with Apache Hadoop. It utilizes schemas defined in JSON to produce serialized data in a compact binary format. The serialized data can be sent to any destination (i.e. application or program) and can be easily deserialized at the destination because the schema is included in the data.

An Avro schema consists of a JSON string, object, or array that defines the type of schema and the data attributes (field names, data types, etc.) for the schema type. The attributes differ depending on the schema type.

For more information, see avro.apache.org.

Example of an Avro Schema

{
 "type": "record",
 "name": "person",
 "namespace": "example.avro",
 "fields": [
     {"name": "fullName", "type": "string"},
     {"name": "age",  "type": ["int", "null"]},
     {"name": "gender", "type": ["string", "null"]}
     ]
}

What is ORC?

Used to store Hive data, the ORC (Optimized Row Columnar) file format was designed for efficient compression and improved performance for reading, writing, and processing data over earlier Hive file formats. For more information about ORC, see https://orc.apache.org/.

Snowflake reads ORC data into a single VARIANT column. You can query the data in a VARIANT column just as you would JSON data, using similar commands and functions.

Alternatively, you can extract select columns from a staged ORC file into separate table columns using a CREATE TABLE AS SELECT statement.

ORC is a binary format.

Note

  • The maximum length of ORC binary and string columns is subject to the Snowflake 16MB limit for VARCHAR data.

  • Map data is deserialized into an array of objects, e.g.:

    "map": [{"key": "chani", "value": {"int1": 5, "string1": "chani"}}, {"key": "mauddib", "value": {"int1": 1, "string1": "mauddib"}}]
    
  • Union data is deserialized into a single object, e.g.:

    {"time": "1970-05-05 12:34:56.197", "union": {"tag": 0, "value": 3880900}, "decimal": 3863316326626557453.000000000000000000}
    

Example of ORC Data Loaded into a VARIANT Column

+--------------------------------------+
| SRC                                  |
|--------------------------------------|
| {                                    |
|   "boolean1": false,                 |
|   "byte1": 1,                        |
|   "bytes1": "0001020304",            |
|   "decimal1": 12345678.654745,       |
|   "double1": -1.500000000000000e+01, |
|   "float1": 1.000000000000000e+00,   |
|   "int1": 65536,                     |
|   "list": [                          |
|     {                                |
|       "int1": 3,                     |
|       "string1": "good"              |
|     },                               |
|     {                                |
|       "int1": 4,                     |
|       "string1": "bad"               |
|     }                                |
|   ]                                  |
| }                                    |
+--------------------------------------+

What is Parquet?

Parquet is a compressed, efficient columnar data representation designed for projects in the Hadoop ecosystem. The file format supports complex nested data structures and uses Dremel record shredding and assembly algorithms. For more information, see parquet.apache.org/documentation/latest/.

Snowflake reads Parquet data into a single VARIANT column. You can query the data in a VARIANT column just as you would JSON data, using similar commands and functions.

Alternatively, you can extract select columns from a staged Parquet file into separate table columns using a CREATE TABLE AS SELECT statement.

Parquet is a binary format. It is not possible to provide an example of a Parquet file.

Example of Parquet Data Loaded into a VARIANT Column

+------------------------------------------+
| SRC                                      |
|------------------------------------------|
| {                                        |
|   "continent": "Europe",                 |
|   "country": {                           |
|     "city": {                            |
|       "bag": [                           |
|         {                                |
|           "array_element": "Paris"       |
|         },                               |
|         {                                |
|           "array_element": "Nice"        |
|         },                               |
|         {                                |
|           "array_element": "Marseilles"  |
|         },                               |
|         {                                |
|           "array_element": "Cannes"      |
|         }                                |
|       ]                                  |
|     },                                   |
|     "name": "France"                     |
|   }                                      |
| }                                        |
+------------------------------------------+

What is XML?

XML (Extensible Markup Language) is a markup language that defines a set of rules for encoding documents. It was originally based on SGML, another markup language developed for standardizing the structure and elements that comprise a document.

Since its introduction, XML has grown beyond an initial focus on documents to encompass a wide range of uses, including representation of arbitrary data structures and serving as the base language for communication protocols. Because of it’s extensibility, versatility, and usability, it has become one of the most commonly-used standards for data interchange on the Web.

An XML document consists primarily of the following constructs:

  • Tags (identified by angle brackets, < and >)
  • Elements

Elements typically consist of a “start” tag and matching “end” tag, with the text between the tags constituting the content for the element. An element can also consist of an “empty-element” tag with no “end” tag. “start” and “empty-element” tags may contain attributes, which help define the characteristics or metadata for the element.

Example of an XML Document

<?xml version="1.0"?>
<!DOCTYPE parts system "parts.dtd">
<?xml-stylesheet type="text/css" href="xmlpartsstyle.css"?>
<parts>
   <title>Automobile Parts & Accessories</title>
   <part>
      <item>Spark Plugs</item>
      <partnum>A3-400</partnum>
      <price> 27.00</price>
   </part>
   <part>
      <item>Motor Oil</item>
      <partnum>B5-200</partnum>
      <price> 14.00</price>
   </part>
   <part>
      <item>Motor Oil</item>
      <partnum>B5-300</partnum>
      <price> 16.75</price>
   </part>
   <part>
      <item>Engine Coolant</item>
      <partnum>B6-120</partnum>
      <price> 19.00</price>
   </part>
   <part>
      <item>Engine Coolant</item>
      <partnum>B6-220</partnum>
      <price> 18.25</price>
   </part>
</parts>