Run Spark workloads from a third-party client

You can run Spark workloads interactively from Jupyter Notebooks, VS Code, or any Python-based interface without needing to manage a Spark cluster. The workloads run on the Snowflake infrastructure.

For example, you can do the following tasks:

  1. Confirm that you have prerequisites.

  2. Set up your environment to connect with Snowpark Connect for Spark on Snowflake.

  3. Install Snowpark Connect for Spark.

  4. Run PySpark code from your client to run on Snowflake.

Prerequisites

Confirm that your Python and Java installations are based on the same computer architecture. For example, if Python is based is arm64, Java must also be arm64 (not x86_64, for example).

Set up your environment

You can set up your development environment by ensuring the your code can connect to Snowpark Connect for Spark on Snowflake. To connect to Snowflake client code will use a .toml file containing connection details.

If you have Snowflake CLI installed, you can use it to define a connection. Otherwise, you can manually write connection parameters in a config.toml file.

Add a connection by using Snowflake CLI

You can use Snowflake CLI to add connection properties that Snowpark Connect for Spark can use to connect to Snowflake. Your changes are saved to a config.toml file.

  1. Run the following command to add a connection using the snow connection add command.

    snow connection add
    
    Copy
  2. Follow the prompts to define a connection.

    Be sure to specify spark-connect as the connection name.

    This command adds a connection to your config.toml file, as in the following example:

    [connections.spark-connect]
    host = "example.snowflakecomputing.com"
    port = 443
    account = "example"
    user = "test_example"
    password = "password"
    protocol = "https"
    warehouse = "example_wh"
    database = "example_db"
    schema = "public"
    
    Copy
  3. Run the following command to confirm that the connection works.

    You can test the connection in this way when you’ve added it by using Snowflake CLI.

    snow connection list
    snow connection test --connection spark-connect
    
    Copy

Add a connection by manually writing a connection file

You can manually write or update a connections.toml file so that your code can connect to Snowpark Connect for Spark on Snowflake.

  1. Run the following command to ensure that your connections.toml file allows only the owner (user) to have read and write access.

    chmod 0600 "~/.snowflake/connections.toml"
    
    Copy
  2. Edit the connections.toml file so that it contains a [spark-connect] connection with the connection properties in the following example.

    Be sure to replace values with your own connection specifics.

    [spark-connect]
    host="my_snowflake_account.snowflakecomputing.com"
    account="my_snowflake_account"
    user="my_user"
    password="&&&&&&&&"
    warehouse="my_wh"
    database="my_db"
    schema="public"
    
    Copy

Install Snowpark Connect for Spark

You can install Snowpark Connect for Spark as a Python package.

  1. Create a Python virtual environment.

    For example, you can use Conda, as in the following example.

    conda create -n xxxx pip python=3.12
    conda activate xxxx
    
    Copy
  2. Install the Snowpark Connect for Spark package.

    pip install --upgrade --force-reinstall snowpark-connect
    
    Copy
  3. Add Python code to start a Snowpark Connect for Spark server and create a Snowpark Connect for Spark session.

    import os
    import snowflake.snowpark
    from snowflake import snowpark_connect
    # Import snowpark_connect before importing pyspark libraries
    from pyspark.sql.types import Row
    
    os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1"
    snowpark_connect.start_session()  # Start the local Snowpark Connect for Spark session
    spark = snowpark_connect.get_session()
    
    Copy

Run Python code from your client

Once you have an authenticated connection in place, you can write code as you normally would.

You can run PySpark code that connects to Snowpark Connect for Spark by using the PySpark client library.

from pyspark.sql import Row

  df = spark.createDataFrame([
      Row(a=1, b=2.),
      Row(a=2, b=3.),
      Row(a=4, b=5.),])

  print(df.count())
Copy

Getting Started with Snowpark Connect for Scala Applications

This guide walks you through setting up Snowpark Connect and connecting your Scala applications to the Snowpark Connect server.

Step 1: Set Up Snowark Connect Environment

For instructions on setting up your Python environment and installing Snowpark Connect, follow the official Snowpark connect documentation:

The key steps include:

  1. Creating a Python virtual environment

  2. Installing Snowpark Connect for Spark

  3. Setting up your connection configuration

Step 2: Create Snowpark Connect Server Script

Create a Python script to launch the Snowpark Connect server.

  1. launch-snowpark-connect.py

    from snowflake import snowpark_connect
    
    def main():
        snowpark_connect.start_session(is_daemon=False, remote_url="sc://localhost:15002")
        print("SAS started on port 15002")
    
    if __name__ == "__main__":
        main()
    
    Copy
  2. Launch Snowpark Connect Server

    # Make sure you're in the correct Python environment
    pyenv activate your-snowpark-connect-env
    
    # Run the server script
    python launch-snowpark-connect.py
    
    Copy

Step 3: Set Up Scala Application

  1. Add the Spark Connect client dependency to your build.sbt file:

    libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "3.5.3"
    
    // Add JVM options for Java 9+ module system compatibility
    javaOptions ++= Seq(
      "--add-opens=java.base/java.nio=ALL-UNNAMED"
    )
    
    Copy
  2. Execute Scala Code to connect to Snowpark Connect for Spark server

    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.connect.client.REPLClassDirMonitor
    
    object SnowparkConnectExample {
      def main(args: Array[String]): Unit = {
        // Create Spark session with Snowpark Connect
        val spark = SparkSession.builder().remote("sc://localhost:15002").getOrCreate()
    
        // Register ClassFinder for UDF support (if needed)
        // val classFinder = new REPLClassDirMonitor("target/scala-2.12/classes")
        // spark.registerClassFinder(classFinder)
    
        try {
          // Simple DataFrame operations
          import spark.implicits._
    
          val data = Seq(
            (1, "Alice", 25),
            (2, "Bob", 30),
            (3, "Charlie", 35)
          )
    
          val df = spark.createDataFrame(data).toDF("id", "name", "age")
    
          println("Original DataFrame:")
          df.show()
    
          println("Filtered DataFrame (age > 28):")
          df.filter($"age" > 28).show()
    
          println("Aggregated result:")
          df.groupBy().avg("age").show()
    
        } finally {
          spark.stop()
        }
      }
    }
    
    Copy
  3. Compile and run your application.

    # Compile your Scala application
    sbt compile
    
    # Run the application
    sbt "runMain SnowparkConnectExample"
    
    Copy

UDF support on Snowpark Connect for Spark

When using user defined functions or custom code, do one of the following:

  • Register a ClassFinder to monitor and upload class files.

    import org.apache.spark.sql.connect.client.REPLClassDirMonitor
    
    val classFinder = new REPLClassDirMonitor("/absolute/path/to/target/scala-2.12/classes")
    spark.registerClassFinder(classFinder)
    
    Copy
  • Upload JAR dependencies if needed.

    spark.addArtifact("/absolute/path/to/dependency.jar")
    
    Copy

Troubleshoot Snowpark Connect for Spark installation

With the following list of checks, you can troubleshoot Snowpark Connect for Spark installation and use.

  • Ensure that Java and Python are based on the same architecture.

  • Use the most recent Snowpark Connect for Spark package file, as described in Install Snowpark Connect for Spark.

  • Confirm that the python command with PySpark code is working correctly for local execution—that is, without Snowflake connectivity.

    For example, execute a command such as the following:

    python your_pyspark_file.py
    
    Copy