Overview of the Spark Connector¶
The Snowflake Connector for Spark enables using Snowflake as an Apache Spark data source, similar to other data sources (PostgreSQL, HDFS, S3, etc.).
In this Topic:
Interaction Between Snowflake and Spark¶
The connector supports bi-directional data movement between a Snowflake cluster and a Spark cluster. The Spark cluster can be self-hosted or accessed through another service, such as Qubole, AWS EMR, or Databricks.
Using the connector, you can perform the following operations:
- Populate a Spark DataFrame from a table (or query) in Snowflake.
- Write the contents of a Spark DataFrame to a table in Snowflake.
The connector uses Scala 2.1.x to perform these operations and uses the Snowflake JDBC driver to communicate with Snowflake.
The Snowflake Connector for Spark is not strictly required to connect Snowflake and Apache Spark; other 3rd-party JDBC drivers can be used. However, we recommend using the Snowflake Connector for Spark because the connector, in conjunction with the Snowflake JDBC driver, has been optimized for transferring large amounts of data between the two systems. It also provides enhanced performance by supporting query pushdown from Spark into Snowflake.
The Snowflake Spark Connector supports two transfer modes: external and internal.
- External transfers use a temporary storage location specified by the user.
- Internal transfers use a temporary location managed by Snowflake.
Use an external transfer if either of the following is true:
- You are using version 2.1.x or lower of the Spark Connector (which does not support internal transfers), or
- Your transfer is likely to take 36 hours or more (internal transfers use temporary credentials that expire after 36 hours).
If neither of these conditions is true, we recommend using an internal data transfer.
Internal Data Transfers¶
The transfer of data between the two systems is facilitated through a Snowflake internal stage that the connector automatically creates and manages:
- Upon connecting to Snowflake and initializing a session in Snowflake, the connector creates the internal stage.
- Throughout the duration of the Snowflake session, the connector uses the stage to store data while transferring it to its destination.
- At the end of the Snowflake session, the connector drops the stage, thereby removing all the temporary data in the stage.
|AWS:||The internal data transfer mode is supported only in version 2.2.0 (and higher) of the connector.|
|Azure:||The internal data transfer mode is supported only in version 2.4.0 (and higher) of the connector.|
External Data Transfers¶
The transfer of data between the two systems is facilitated through a temporary storage location that the user specifies.
|AWS:||This storage location is an S3 bucket.|
|Azure:||This storage location is an Azure container. External transfers via Azure are supported only in version 2.4.0 (and higher) of the connector.|
The parameter(s) to specify the storage location are documented in Setting Configuration Options for the Connector
For external transfers, the Spark connector does not automatically delete the file(s) from this temporary storage location. There are three ways to delete the temporary files:
- delete them manually
- set the connector’s
purgeparameter (for more information about this parameter, see Setting Configuration Options for the Connector)
- set a storage system parameter, such as the Amazon S3 lifecycle policy parameter, to clean up the files after the transfer is done.
For external transfers, the storage location must be created and configured as part of the connector installation/configuration.
When you copy data from a Spark table to a Snowflake table, if the
column names do not match, you can map column names from Spark to
Snowflake by using the
columnmapping parameter, which is documented
in Setting Configuration Options for the Connector.
Column mapping is supported only for internal storage.
Version 2.1.0 (and higher) of the Snowflake Connector for Spark provides enhanced performance through query pushdown. For optimal performance, you typically want to avoid reading lots of data or transferring large intermediate results between systems. Ideally, most of the processing should happen close to where the data is stored to leverage the capabilities of the participating stores to dynamically eliminate data that is not needed.
Query pushdown leverages these performance efficiencies by enabling large and complex Spark logical plans (in their entirety or in parts) to be processed in Snowflake, thus using Snowflake to do most of the actual work.
Qubole has integrated the Snowflake Connector for Spark into the Qubole Data Service (QDS) ecosystem to provide native connectivity between Spark and Snowflake. Through this integration, Snowflake can be added as a Spark data store directly in Qubole.
Once Snowflake has been added as a Spark data store, data engineers and data scientists can use Spark and the QDS UI, API, and Notebooks to:
- Perform advanced data transformations, such as preparing and consolidating external data sources into Snowflake, or refining and transforming Snowflake data.
- Build, train, and execute machine learning and AI models in Spark using the data that already exists in Snowflake.