Installing and Configuring the Spark Connector

Multiple versions of the connector are supported; however, Snowflake strongly recommends using the most recent version of the connector. For the latest version, see the Spark Connector Release Notes (link in the sidebar).

The instructions in this topic can be used to install and configure all supported versions of the connector.

In this Topic:

Comparison of Supported Versions

Note the following supported versions of the connector:

  Snowflake Connector 2.x
Supported Spark versions Spark 2.1, 2.2, 2.3
Supported Scala versions Scala 2.10 and 2.11
Data source name net.snowflake.spark.snowflake
Package name for imported classes net.snowflake.spark.snowflake
Package distribution Maven Central Repository or Spark Packages
Source code branch in spark-snowflake (GitHub) master or previous_spark_version

The developer notes for the different versions are hosted with the source code.

Note

Snowflake generally supports the three most recent versions of Spark.

The default connector typically supports only the latest Spark version. To use the Snowflake Spark connector with versions of Spark that are older, but still supported, you might need to download a version of the connector that is specific to your Spark version. For example, to use Spark version 2.2 with version 2.4.4 of the Snowflake connector, download the 2.4.4-spark_2.2 version of the connector.

Prerequisites

To use Snowflake with Spark, you need an Apache Spark environment, either self-hosted or hosted in Qubole Data Service, Databricks, or Amazon EMR.

In addition, you can use a dedicated AWS S3 bucket or Azure Blog Storage container as a staging zone between the two systems; however, this is not required with version 2.2.0 (and higher) of the connector, which uses a temporary Snowflake internal stage (by default) for all data exchange.

Note

If you are using Databricks or Qubole to host Spark, you do not need to install the Snowflake Connector for Spark (or any of its prerequisites). Both Databricks and Qubole have integrated the connector to provide native connectivity.

For more details, see:

Downloading and Installing Version 2.x

The instructions in this section pertain to version 2.x and higher of the Snowflake Connector for Spark.

Important

Snowflake periodically releases new versions of the connector. The following installation tasks must be performed each time you install a new version. This also applies to the Snowflake JDBC driver, which is a prerequisite for the connector.

Step 1: Download the Latest Version of the Snowflake JDBC Driver

The Snowflake JDBC Driver is required in order to use the Snowflake Spark Connector.

The Snowflake JDBC driver is provided as a standard Java package through the Maven Central Repository. You can either download the package as a .jar file or you can directly reference the package. These instructions assume you are referencing the package.

For more information, see JDBC Driver.

Step 2: Download the Latest Version of the Snowflake Spark Connector

There are many different versions of the Snowflake Spark Connector. You’ll need to download the version that is compatible with:

  • The version of Spark you are using.
  • The version of Scala you are using.
  • The version of Snowflake you are using.

The Snowflake Spark Connector can be downloaded from either Maven or the Spark Packages web site. More details are below.

Maven

The Maven web site’s download page for the Spark connector looks similar to the following:

+---------------+-----------------------+-----------------+
| Group ID      |  Artifact ID          | Latest Version  |
+===============+=======================+=================+
| net.snowflake | spark-snowflake_2.11  | 2.4.5-spark_2.2 |
| ...           |                       |                 |
+---------------+-----------------------+-----------------+

Or, more generally:

+---------------+-----------------------+-----------------+
| Group ID      |  Artifact ID          | Latest Version  |
+===============+=======================+=================+
| net.snowflake | spark-snowflake_C.CC  | N.N.N-spark_P.P |
| ...           |                       |                 |
+---------------+-----------------------+-----------------+

where

  • C.CC is the sCala version (e.g. 2.11).
  • N.N.N is the sNowflake version (e.g. 2.4.5).
  • P.P is the sPark version (e.g. 2.2).

For version 2.11 of Scala, download from:

For version 2.10 of Scala, download from:

Spark Packages

The naming convention of the files on the Spark Packages web site is:

N.N.N_C.CC-spark_P.P

where

  • C.CC is the sCala version (e.g. 2.11).
  • N.N.N is the sNowflake version (e.g. 2.4.5).
  • P.P is the sPark version (e.g. 2.2).

For example:

2.4.5-s_2.11-spark_2.1

You can download the connector from the Spark Packages web site: Spark Connector.

Note for GitHub Users

The source code for the Spark Snowflake Connector is available on GitHub. However, the compiled packages are not available on GitHub; you can download them from Maven or the Spark Packages web site as described above.

Step 3: Verify the Snowflake Connector for Spark Package Signature (Linux Only) — Optional

To optionally verify the Snowflake Connector for Spark package signature for Linux:

Note

The Mac OS and Windows operating systems can verify the installer signature automatically, so GPG signature verification is not needed.

  1. Download and import the latest Snowflake GPG public key from the public keyserver:

    $ gpg --keyserver hkp://keys.gnupg.net --recv-keys 93DB296A69BE019A
    
  2. Download the GPG signature along with the bash installer and verify the signature:

    $ gpg --verify spark-snowflake_2.11-2.1.2-spark_2.0.jar
    gpg: Signature made Wed 22 Feb 2017 04:31:58 PM UTC using RSA key ID 93DB296A69BE019A
    gpg: Good signature from "Snowflake Computing <snowflake_gpg@snowflake.net>"
    
  3. Your local environment can contain multiple GPG keys; however, for security reasons, Snowflake periodically rotates the public GPG key. As a best practice, we recommend deleting the existing public key after confirming that the latest key works with the latest signed package. For example:

    $ gpg --delete-key "Snowflake Computing"
    

Step 4: Configure the Local Spark Cluster or Amazon EMR-hosted Spark Environment

If you have a local Spark installation, or a Spark installation in Amazon EMR, you need to configure the spark-shell program to include both the Snowflake JDBC driver and the Spark Connector:

  • To include the Snowflake JDBC driver, use the --package option to reference the JDBC package hosted in the Maven Central Repository, providing the exact version of the driver you wish to use (e.g. net.snowflake:snowflake-jdbc:3.0.12).
  • To include the Spark Connector, use the --package option to reference the appropriate package (Scala 2.10 or Scala 2.11) hosted in the Maven Central Repository, providing the exact version of the driver you want to use (e.g. net.snowflake:spark-snowflake_2.10:2.0.0).

For example:

spark-shell --packages net.snowflake:snowflake-jdbc:2.8.1,net.snowflake:spark-snowflake_2.10:2.0.0

Installing Additional Packages (If Needed)

Depending on your Spark installation, some packages required by the connector may be missing. You can add missing packages to your installation by using the appropriate flag for spark-shell:

  • --packages
  • --jars (if the packages were downloaded as .jar files)

The required packages are listed below, with the syntax (including version number) for using the --packages flag to reference the packages:

  • org.apache.hadoop:hadoop-aws:2.7.1
  • org.apache.httpcomponents:httpclient:4.3.6
  • org.apache.httpcomponents:httpcore:4.3.3
  • com.amazonaws:aws-java-sdk-core:1.10.27
  • com.amazonaws:aws-java-sdk-s3:1.10.27
  • com.amazonaws:aws-java-sdk-sts:1.10.27

For example, if the Apache packages are missing, to add the packages by reference:

spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.1,org.apache.httpcomponents:httpclient:4.3.6,org.apache.httpcomponents:httpcore:4.3.3

Preparing an External Location For Files

You might need to prepare an external location for files that you want to transfer between Snowflake and Spark.

Note

This task is only required in either of the following circumstances:

  • The Snowflake Connector for Spark version is 2.1.x (or lower). Starting with v2.2.0, the connector uses a Snowflake internal temporary stage for all data exchange. If you are not currently using version 2.2.0 (or higher) of the connector, Snowflake strongly recommends upgrading to the latest version.
  • The Snowflake Connector for Spark version is 2.2.0 (or higher), but your jobs regularly exceed 36 hours in length. This is the maximum duration for the token used by the connector to access the internal stage for data exchange.

Preparing an AWS External S3 Bucket

Prepare an external S3 bucket that the connector can use to exchange data between Snowflake and Spark. You then provide the location information, together with the necessary AWS credentials for the location, to the connector. For more details, see Authenticating S3 for Data Exchange in the next topic.

Important

If you use an external S3 bucket, the connector does not automatically remove any intermediate/temporary data from this location. As a result, it’s best to use a specific bucket or path (prefix) and set a lifecycle policy on the bucket/path to clean up older files automatically. See the Amazon S3 documentation for more details on configuring a lifecycle policy.

Preparing an Azure Blob Storage Container

Prepare an external Azure Blob Storage Container that the connector can use to exchange data between Snowflake and Spark. You then provide the location information, together with the necessary Azure credentials for the location, to the connector. For more details, see Authenticating Azure for Data Exchange in the next topic.