Introduction to External Tables

In a typical table, the data is stored in the database; however, in an external table, the data is stored in files in an external stage. External tables store file-level metadata about the data files, such as the filename, a version identifier and related properties. This enables querying data stored in files in an external stage as if it were inside a database. External tables can access data stored in any format supported by COPY INTO <table> statements.

External tables are read-only, therefore no DML operations can be performed on them; however, external tables can be used for query and join operations. Views can be created against external tables.

Querying data stored external to the database is likely to be slower than querying native database tables; however, materialized views based on external tables can improve query performance.

In this Topic:

Planning the Schema of an External Table

This section describes the options available for designing external tables.

Schema on Read

All external tables include the following columns:

VALUE

A VARIANT type column that represents a single row in the external file.

METADATA$FILENAME

A pseudocolumn that identifies the name of each staged data file included in the external table, including its path in the stage.

To create external tables, you are only required to have some knowledge of the file format and record format of the source data files. Knowing the schema of the data files is not required. When queried, external tables cast all regular or semi-structured data to a variant in the VALUE column.

Virtual Columns

If you are familiar with the schema of the source data files, you can create additional virtual columns as expressions using the VALUE column and/or the METADATA$FILENAME pseudocolumn. When the external data is scanned, the data types of any specified fields or semi-structured data elements in the data file must match the data types of these additional columns in the external table. This allows strong type checking and schema validation over the external data.

Partitioned External Tables

We strongly recommend partitioning your external tables, which requires that your underlying data is organized using logical paths that include date, time, country, or similar dimensions in the path. Partitioning divides your external table data into multiple parts using partition columns. Partition columns must evaluate as expressions that parse the path and/or filename information stored in the METADATA$FILENAME pseudocolumn. A partition consists of all data files that match the path and/or filename in the expression for the partition column.

An external table definition can include multiple partition columns, which impose a multi-dimensional structure on the external data.

Benefits of partitioning include improved query performance. Because the external data is partitioned into separate slices/parts, query response time is faster when processing a small part of the data instead of scanning the entire data set.

Partition columns can be defined when an external table is created (using CREATE EXTERNAL TABLE) or at a later time (using ALTER EXTERNAL TABLE) using the PARTITIONED BY clause, with a list of column definitions for partitioning.

Syntax

Define partitions in external table data by defining the partition columns:

CREATE EXTERNAL TABLE
  <table_name>
     ( <part_col_name> <col_type> AS <part_expr> )
     [ , ... ] )
  [ PARTITION BY ( <identifier> [, <identifier> ... ] ) ]
  ..

For examples, see CREATE EXTERNAL TABLE.

Adding or Dropping Columns

Alter an existing external table to add or remove columns using the following ALTER TABLE syntax:

  • Add columns: ALTER TABLE … ADD COLUMN.

  • Remove columns: ALTER TABLE … DROP COLUMN.

Note

The default VALUE and METADATA$FILENAME columns cannot be dropped.

Materialized Views over External Tables

In many cases, materialized views over external tables can provide performance that is faster than equivalent queries over the underlying external table. This performance difference can be significant when a query is run frequently or is sufficiently complex.

Automatically Refreshing External Table Metadata

The metadata for an external table can be refreshed automatically using the following event notification service for each storage location:

The refresh operation synchronizes the metadata with the latest set of associated files in the external stage and path, i.e.:

  • New files in the path are added to the table metadata.

  • Changes to files in the path are updated in the table metadata.

  • Files no longer in the path are removed from the table metadata.

Note

Currently, the ability to automatically refresh the metadata is not available for external tables that reference Google Cloud Storage stages.

As a workaround, we suggest following our best practices for staging your data files and periodically executing an ALTER EXTERNAL TABLE … REFRESH statement to register any missed files. For satisfactory performance, we also recommend using a selective path prefix with ALTER EXTERNAL TABLE to reduce the number of files that need to be listed and checked if they have been registered already (e.g. bucket_name/YYYY/MM/DD/ or even bucket_name/YYYY/MM/DD/HH/ depending on your volume).

Attention

AWS S3 only. External tables rely on Amazon S3 event notifications to determine when new files arrive in monitored S3 buckets so the files can be registered in the metadata. Currently, AWS cannot guarantee reliable delivery of event notifications to consumers. Therefore, until AWS can provide reliability guarantees for S3 event notifications, Snowflake cannot guarantee registering of all files staged to monitored S3 buckets.

For now, we recommend periodically executing an ALTER EXTERNAL TABLE … REFRESH statement to synchronize the metadata with the latest set of associated files in the external stage and path. For a workaround, see the above note in this section.

Billing for Refreshing External Table Metadata

An overhead to manage event notifications (for the files added in cloud storage) is included in your charges. This overhead increases in relation to the number of files added in cloud storage for the external stages and paths specified for your external tables. Snowflake charges 0.06 credits per 1000 event notifications received.

In addition, when external tables are generally available (date TBD), a small maintenance overhead will be charged for manually refreshing the external table metadata (using ALTER EXTERNAL TABLE … REFRESH). This overhead is included for the resources consumed when retrieving the list of objects, along with descriptive information about each object, in the external stage and path. The rate for this overhead has not been set yet.

Workflow

AWS S3

This section provides a high-level overview of the setup and load workflow for external tables that reference AWS S3 stages. For complete instructions, see Refreshing External Tables Automatically for AWS S3.

  1. Create a named stage object (using CREATE STAGE) that references the external location (i.e. S3 bucket) where your data files are staged.

  2. Create an external table (using CREATE EXTERNAL TABLE) that references the named stage.

  3. Manually refresh the external table metadata using ALTER EXTERNAL TABLE … REFRESH to synchronize the metadata with the current list of files in the stage path. This step also verifies the settings in your external table definition.

  4. Configure an event notification for the S3 bucket. Snowflake relies on event notifications to continually refresh the external table metadata to maintain consistency with the staged files.

  5. Manually refresh the external table metadata one more time using ALTER EXTERNAL TABLE … REFRESH to synchronize the metadata with any changes that occurred since Step 3. Thereafter, the S3 event notifications trigger the metadata refresh automatically.

  6. Configure Snowflake access control privileges for any additional roles to grant them query access to the external table.

Periodically refresh the external table metadata using ALTER EXTERNAL TABLE … REFRESH to synchronize the external table metadata with the stage to register any missed files.

Google Cloud Storage

This section provides a high-level overview of the setup and load workflow for external tables that reference Google Cloud Storage (GCS) stages.

  1. Create a named stage object (using CREATE STAGE) that references the external location (i.e. GCS bucket) where your data files are staged.

  2. Create an external table (using CREATE EXTERNAL TABLE) that references the named stage.

  3. Configure Snowflake access control privileges for any additional roles to grant them query access to the external table.

  4. Periodically refresh the external table metadata using ALTER EXTERNAL TABLE … REFRESH to synchronize the metadata with the current list of files in the stage path. This step also verifies the settings in your external table definition.

Microsoft Azure

This section provides a high-level overview of the setup and load workflow for external tables that reference Azure stages. For complete instructions, see Refreshing External Tables Automatically for Azure Blob Storage.

  1. Configure an Event Grid subscription for Azure Storage events.

  2. Create a notification integration in Snowflake. A notification integration is a Snowflake object that provides an interface between Snowflake and third-party cloud message queuing services such as Microsoft Event Grid.

  3. Create a named stage object (using CREATE STAGE) that references the external location (i.e. Azure container) where your data files are staged.

  4. Create an external table (using CREATE EXTERNAL TABLE) that references the named stage and integration.

  5. Manually refresh the external table metadata once using ALTER EXTERNAL TABLE … REFRESH to synchronize the metadata with any changes that occurred since Step 4. Thereafter, the Event Grid notifications trigger the metadata refresh automatically.

  6. Configure Snowflake access control privileges for any additional roles to grant them query access to the external table.

Apache Hive Metastore Integration

Snowflake supports integrating Apache Hive metastores with Snowflake using external tables. The Hive connector detects metastore events and transmits them to Snowflake to keep the external tables synchronized with the Hive metastore. This allows users to manage their data in Hive while querying it from Snowflake.

For instructions, see Integrating Apache Hive Metastores with Snowflake.

External Table DDL

To support creating and managing external tables, Snowflake provides the following set of special DDL commands:

Required Access Privileges

Creating and managing external tables requires a role with a minimum of the following role permissions:

Object

Privilege

Database

USAGE

Schema

USAGE, CREATE STAGE (if creating a new stage), CREATE EXTERNAL TABLE

Stage (if using an existing stage)

USAGE

Information Schema

The Snowflake Information Schema includes views and table functions you can query to retrieve information about your external tables and their staged data files.

View

EXTERNAL_TABLES View

Retrieve a list of external tables in the specified (or current) database.

Table Functions

EXTERNAL_TABLE_FILES

Retrieve information about the staged data files included in the metadata for a specified external table.

EXTERNAL_TABLE_FILE_REGISTRATION_HISTORY

Retrieve information about the metadata history for an external table, including any errors found when refreshing the metadata.