Overview of the Spark Connector

The Snowflake Connector for Spark enables using Snowflake as an Apache Spark data source, similar to other data sources (PostgreSQL, HDFS, S3, etc.).

Snowflake as a data source for Spark

In this Topic:

Interaction Between Snowflake and Spark

The connector supports bi-directional data movement between a Snowflake cluster and a Spark cluster. The Spark cluster can be self-hosted or accessed through another service, such as Qubole, AWS EMR, or Databricks.

Using the connector, you can perform the following operations:

  • Populate a Spark DataFrame from a table (or query) in Snowflake.
  • Write the contents of a Spark DataFrame to a table in Snowflake.
Interaction between Snowflake and Spark

The connector uses Scala 2.1.x to perform these operations and uses the Snowflake JDBC driver to communicate with Snowflake.

Note

The Snowflake Connector for Spark is not strictly required to connect Snowflake and Apache Spark; other 3rd-party JDBC drivers can be used. However, we recommend using the Snowflake Connector for Spark because the connector, in conjunction with the Snowflake JDBC driver, has been optimized for transferring large amounts of data between the two systems. It also provides enhanced performance by supporting query pushdown from Spark into Snowflake.

Data Transfer

The Snowflake Spark Connector supports two transfer modes: external and internal.

  • External transfers use a temporary storage location specified by the user.
  • Internal transfers use a temporary location managed by Snowflake.

Use an external transfer if either of the following is true:

  • You are using version 2.1.x or lower of the Spark Connector (which does not support internal transfers), or
  • Your transfer is likely to take 36 hours or more (internal transfers use temporary credentials that expire after 36 hours).

If neither of these conditions is true, we recommend using an internal data transfer.

Internal Data Transfers

The transfer of data between the two systems is facilitated through a Snowflake internal stage that the connector automatically creates and manages:

  • Upon connecting to Snowflake and initializing a session in Snowflake, the connector creates the internal stage.
  • Throughout the duration of the Snowflake session, the connector uses the stage to store data while transferring it to its destination.
  • At the end of the Snowflake session, the connector drops the stage, thereby removing all the temporary data in the stage.
AWS:The internal data transfer mode is supported only in version 2.2.0 (and higher) of the connector.
Azure:The internal data transfer mode is supported only in version 2.4.0 (and higher) of the connector.

External Data Transfers

The transfer of data between the two systems is facilitated through a temporary storage location that the user specifies.

AWS:This storage location is an S3 bucket.
Azure:This storage location is an Azure container. External transfers via Azure are supported only in version 2.4.0 (and higher) of the connector.

The parameter(s) to specify the storage location are documented in Setting Configuration Options for the Connector

For external transfers, the Spark connector does not automatically delete the file(s) from this temporary storage location. There are three ways to delete the temporary files:

  • delete them manually
  • set the connector’s purge parameter (for more information about this parameter, see Setting Configuration Options for the Connector)
  • set a storage system parameter, such as the Amazon S3 lifecycle policy parameter, to clean up the files after the transfer is done.

For external transfers, the storage location must be created and configured as part of the connector installation/configuration.

Column Mapping

When you copy data from a Spark table to a Snowflake table, if the column names do not match, you can map column names from Spark to Snowflake by using the columnmapping parameter, which is documented in Setting Configuration Options for the Connector.

Column mapping is supported only for internal storage.

Query Pushdown

Version 2.1.0 (and higher) of the Snowflake Connector for Spark provides enhanced performance through query pushdown. For optimal performance, you typically want to avoid reading lots of data or transferring large intermediate results between systems. Ideally, most of the processing should happen close to where the data is stored to leverage the capabilities of the participating stores to dynamically eliminate data that is not needed.

Query pushdown leverages these performance efficiencies by enabling large and complex Spark logical plans (in their entirety or in parts) to be processed in Snowflake, thus using Snowflake to do most of the actual work.

Qubole Integration

Qubole has integrated the Snowflake Connector for Spark into the Qubole Data Service (QDS) ecosystem to provide native connectivity between Spark and Snowflake. Through this integration, Snowflake can be added as a Spark data store directly in Qubole.

Once Snowflake has been added as a Spark data store, data engineers and data scientists can use Spark and the QDS UI, API, and Notebooks to:

  • Perform advanced data transformations, such as preparing and consolidating external data sources into Snowflake, or refining and transforming Snowflake data.
  • Build, train, and execute machine learning and AI models in Spark using the data that already exists in Snowflake.

For more details, see Configuring Snowflake for Spark in Qubole or the Qubole-Snowflake Integration Guide (Qubole Documentation).