Installing and Configuring the Snowflake Connector for Spark

Multiple versions of the connector are supported:

  • Version 2.x, including:
    • 2.0.x
    • 2.1.x
    • 2.2.x (latest version is 2.2.1)
  • Version 1.x (deprecated)

This topic describes how to install and configure all supported versions of the connector; however, version 1.x is deprecated and no further updates are planned. The installation and configuration instructions for version 1.x are provided primarily for backward compatibility. We strongly encourage you to install or upgrade to version 2.x.

All examples in this document refer to version 2.x, but should also work for version 1.x with minimal or no changes.

In this Topic:

Supported Version Comparison

Note the following details and differences between the supported versions of the connector:

  Snowflake Connector 2.x Snowflake Connector 1.x (Deprecated)
Supported Spark versions Spark 2.0 and 2.1 Spark 1.5 and 1.6
Supported Scala versions Scala 2.10 and 2.11 Scala 2.10
Data source name net.snowflake.spark.snowflake com.snowflakedb.spark.snowflakedb
Package name for imported classes net.snowflake.spark.snowflake com.snowflakedb.spark.snowflakedb
Package distribution Maven Central or Spark Packages .jar file (no longer available)
Source code branch (in GitHub) master or previous_spark_version branch-1.x

The developer notes for the different versions are hosted with the source code.

Prerequisites

To use Snowflake with Spark, you need an Apache Spark environment, either self-hosted or using Amazon EMR or Databricks.

In addition, you can use a dedicated AWS S3 location as a staging zone between the two systems; however, this is not required because the connector uses a temporary Snowflake internal stage by default for all data exchange.

Installing the Connector — Version 2.x

The instructions in this section pertain to version 2.x and higher of the Snowflake Connector for Spark. If you are installing version 1.x, see Installing the Connector — Version 1.x (Deprecated) instead.

Important

Snowflake periodically releases new versions of the connector. The following installation tasks must be performed each time you install a new version. This also applies to the Snowflake JDBC driver, which is a prerequisite for the connector.

Step 1: Verify the Latest Version of the Snowflake JDBC Driver

The Snowflake JDBC driver is provided as a standard Java package through Maven Central. You can either download the package as a .jar file or you can directly reference the package. These instructions assume you are referencing the package.

For more information, see Snowflake JDBC Driver.

Step 2: Verify the Snowflake Connector for Spark Package Signature (Linux Only) — Optional

To optionally verify the Snowflake Connector for Spark package signature for Linux:

Note

The Mac OS and Windows operating systems can verify the installer signature automatically, so GPG signature verification is not needed.

  1. Download and import the latest Snowflake GPG public key from the public keyserver:

    $ gpg --keyserver hkp://keys.gnupg.net --recv-keys 69BE019A
    
  2. Download the GPG signature along with the bash installer and verify the signature:

    $ gpg --verify spark-snowflake_2.11-2.1.2-spark_2.0.jar
    gpg: Signature made Wed 22 Feb 2017 04:31:58 PM UTC using RSA key ID 69BE019A
    gpg: Good signature from "Snowflake Computing <snowflake_gpg@snowflake.net>"
    
  3. Your local environment can contain multiple GPG keys; however, for security reasons, Snowflake periodically rotates the public GPG key. As a best practice, we recommend deleting the existing public key after confirming that the latest key works with the latest signed package. For example:

    $ gpg --delete-key "Snowflake Computing"
    

Step 3: Configure the Local Spark Cluster or Amazon EMR-hosted Spark Environment

If you have a local Spark installation, or a Spark installation in Amazon EMR, you need to configure the spark-shell program to include both the Snowflake JDBC driver and the Spark Connector:

  • To include the Snowflake JDBC driver, use the --package option to reference the JDBC package hosted in Maven Central, providing the exact version of the driver you wish to use, e.g. net.snowflake:snowflake-jdbc:3.0.12.
  • To include the Spark Connector, use the --package option to reference the appropriate package (Scala 2.10 or Scala 2.11) hosted in Maven Central, providing the exact version of the driver you want to use, e.g. net.snowflake:spark-snowflake_2.10:2.0.0.

For example:

spark-shell --packages net.snowflake:snowflake-jdbc:2.8.1,net.snowflake:spark-snowflake_2.10:2.0.0

Step 4: Configure Databricks to Use the Connector

If you wish to use spark-snowflake with Databricks, you must create a library referencing the necessary Maven artifacts:

  1. Log into your Databricks account.

  2. Click the Workspace icon on the left.

  3. In the Workspace panel that appears, click to display the menu and select Create > Library:

    ../_images/spark-library.png
  4. In the Create Library page, choose Maven Coordinate for Source.

  5. In the Coordinate search bar, enter snowflake and select the appropriate artifact for your version of Spark.

    ../_images/spark-library-search.png
  6. Click the Create Library button.

    ../_images/spark-library-create.png
  7. On the next screen, check the box for Attach automatically to all clusters.

Repeat the same process for the JDBC driver.

Installing the Connector — Version 1.x (Deprecated)

The instructions in this section pertain to version 1.x (and earlier) of the Snowflake Connector for Spark. If you are installing version 2.0 or higher, see Installing the Connector — Version 2.x instead.

Step 1: Verify Latest Version of the Snowflake JDBC Driver

The Snowflake JDBC driver is provided as a standard Java package through Maven Central. You can either download the file or reference the package.

For more information, see Snowflake JDBC Driver.

Step 2: Verify JAR File for Snowflake Connector for Spark

The Snowflake Connector for Spark 1.5/1.6 is supplied as a .jar file (spark-snowflakedb-<version>.jar). The file was distributed through the Snowflake web interface, but is no longer available. These instructions assume you already have the .jar file.

The .jar file will be referred to at various times during the remainder of the installation. To make installation easier, you may want to use a generic file name (e.g. spark-snowflakedb.jar) instead of the actual file name by either renaming the file or creating a symbolic link to the file.

Note

All remaining references to the .jar file in these installation instructions use spark-snowflakedb.jar as the file name. If you did not choose to rename the file, replace all references to spark-snowflakedb.jar with the actual file name.

Step 3: Configure the Local Spark Cluster or Amazon EMR-hosted Spark Environment

If you have a local Spark installation, or a Spark installation in Amazon EMR, you need to configure the spark-shell program to include both the Snowflake JDBC driver and the Spark Connector:

  • To include the Snowflake JDBC driver, use the --package option to referenced the JDBC package hosted in Maven Central, providing the exact version of the driver you wish to use, e.g. net.snowflake:snowflake-jdbc:3.0.12.
  • To include the Spark Connector, using the --jars option to point to the .jar file.

For example:

spark-shell --packages net.snowflake:snowflake-jdbc:2.8.1 --jars /<path>/<to>/spark-snowflakedb.jar

Step 4: Configure Databricks to Use the Connector

To use spark-snowflakedb with Databricks, add both the spark-snowflakedb.jar and snowflake_jdbc.jar as libraries:

  1. Log into your Databricks account.

  2. Click the Workspace icon on the left.

  3. In the Workspace panel that appears, right-click to display the menu and select Create > Library:

    ../_images/spark-library.png
  4. In the Create Library page, enter spark-snowflakedb in Library Name.

  5. Drag the spark-snowflakedb.jar file into the JAR File box.

    ../_images/spark-library-new.png
  6. When the file is done loading, click the Create Library button.

  7. On the next screen, check the box for Attach automatically to all clusters.

Repeat the same process for the snowflake_jdbc.jar file, naming it accordingly.

Installing Additional Packages (If Needed)

Note

This task applies to both version 2.x and 1.x of the connector.

Depending on your Spark installation, some packages required by the connector may be missing. You can add missing packages to your installation by using the appropriate flag for spark-shell:

  • --packages
  • --jars (if the packages were downloaded as .jar files)

The required packages are listed below, with the syntax (including version number) for using the --packages flag to reference the packages:

  • org.apache.hadoop:hadoop-aws:2.7.1
  • org.apache.httpcomponents:httpclient:4.3.6
  • org.apache.httpcomponents:httpcore:4.3.3
  • com.amazonaws:aws-java-sdk-core:1.10.27
  • com.amazonaws:aws-java-sdk-s3:1.10.27
  • com.amazonaws:aws-java-sdk-sts:1.10.27

For example, if the Apache packages are missing, to add the packages by reference:

spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.1,org.apache.httpcomponents:httpclient:4.3.6`,org.apache.httpcomponents:httpcore:4.3.3

Preparing an S3 Location — Required for Version 2.1.x (or Lower)

Note

This task is required only for version 2.1.x (or lower) of the connector. Starting with v2.2.0, the connector uses a Snowflake internal stage for all data exchange, which eliminates this requirement. If you are not currently using version 2.2.0 (or higher) of the connector, Snowflake strongly recommends upgrading to the latest version.

If you are using an older version of the connector, you need to prepare an S3 location that the connector can use to exchange data between Snowflake and Spark.

You then provide the location information, together with the necessary AWS credentials for the location, to the connector. For more details, see :Authenticating S3 for Data Exchange — Required for Version 2.1.x (or lower) in the next topic.

Important

If you use an S3 location, the connector does not automatically remove any intermediate/temporary data from this location. As a result, it’s best to use a specific bucket or directory and set a lifecycle policy on the bucket/directory to clean up older files automatically. See the Amazon S3 documentation for more details on configuring a lifecycle policy.