Overview of the Spark Connector¶
The Snowflake Connector for Spark enables using Snowflake as an Apache Spark data source, similar to other data sources (PostgreSQL, HDFS, S3, etc.).
In this Topic:
Interaction Between Snowflake and Spark¶
The connector supports bi-directional data movement between a Snowflake cluster and a Spark cluster. The Spark cluster can be self-hosted or accessed through another service, such as Qubole, AWS EMR, or Databricks.
Using the connector, you can perform the following operations:
- Populate a Spark DataFrame from a table (or query) in Snowflake.
- Write the contents of a Spark DataFrame to a table in Snowflake.
The connector uses Scala 2.1.x to perform these operations and uses the Snowflake JDBC driver to communicate with Snowflake.
The Snowflake Connector for Spark is not strictly required to connect Snowflake and Apache Spark; other 3rd-party JDBC drivers can be used. However, we recommend using the Snowflake Connector for Spark because the connector, in conjunction with the Snowflake JDBC driver, has been optimized for transferring large amounts of data between the two systems. It also provides enhanced performance by supporting query pushdown from Spark into Snowflake.
The exchange of data between the two systems is facilitated through a Snowflake internal stage that the connector automatically creates and manages:
- Upon connecting to Snowflake and initializing a session in Snowflake, the connector creates the internal stage.
- Throughout the duration of the Snowflake session, the connector uses the stage to store data while transferring it to its destination.
- At the end of the Snowflake session, the connector drops the stage, thereby removing all the temporary data in the stage.
The internal stage is supported in version 2.2.0 (and later) of the connector. For earlier versions, the data exchange is facilitated through an S3 bucket, which must be created and configured as part of the connector installation/configuration.
Version 2.1.0 (and later) of the Snowflake Connector for Spark provides enhanced performance through query pushdown. For optimal performance, you typically want to avoid reading lots of data or transferring large intermediate results between systems. Ideally, most of the processing should happen close to where the data is stored to leverage the capabilities of the participating stores to dynamically eliminate data that is not needed.
Query pushdown leverages these performance efficiencies by enabling large and complex Spark logical plans (in their entirety or in parts) to be processed in Snowflake, thus using Snowflake to do most of the actual work.
Qubole has integrated the Snowflake Connector for Spark into the Qubole Data Service (QDS) ecosystem to provide native connectivity between Spark and Snowflake. Through this integration, Snowflake can be added as a Spark data store directly in Qubole.
Once Snowflake has been added as a Spark data store, data engineers and data scientists can use Spark and the QDS UI, API, and Notebooks to:
- Perform advanced data transformations, such as preparing and consolidating external data sources into Snowflake, or refining and transforming Snowflake data.
- Build, train, and execute machine learning and AI models in Spark using the data that already exists in Snowflake.