Run Spark workloads from VS Code, Jupyter Notebooks, or a terminal¶

You can run Spark workloads interactively from Jupyter Notebooks, VS Code, or any Python-based interface without needing to manage a Spark cluster. The workloads run on the Snowflake infrastructure.

For example, you can do the following tasks:

Confirm that you have prerequisites.
Set up your environment to connect with Snowpark Connect for Spark on Snowflake.
Install Snowpark Connect for Spark.
Run PySpark code from your client to run on Snowflake.

Prerequisites¶

Confirm that your Python and Java installations are based on the same computer architecture. For example, if Python is based is arm64, Java must also be arm64 (not x86_64, for example).

Set up your environment¶

You can set up your development environment by ensuring the your code can connect to Snowpark Connect for Spark on Snowflake. To connect to Snowflake client code will use a .toml file containing connection details.

If you have Snowflake CLI installed, you can use it to define a connection. Otherwise, you can manually write connection parameters in a config.toml file.

Add a connection by using Snowflake CLI¶

You can use Snowflake CLI to add connection properties that Snowpark Connect for Spark can use to connect to Snowflake. Your changes are saved to a config.toml file.

Run the following command to add a connection using the snow connection add command.
```
snow connection add
```
Copy

Follow the prompts to define a connection.

Be sure to specify spark-connect as the connection name.

This command adds a connection to your config.toml file, as in the following example:

[connections.spark-connect]
host = "example.snowflakecomputing.com"
port = 443
account = "example"
user = "test_example"
password = "password"
protocol = "https"
warehouse = "example_wh"
database = "example_db"
schema = "public"

Copy

Run the following command to confirm that the connection works.

You can test the connection in this way when you’ve added it by using Snowflake CLI.
```
snow connection list
snow connection test --connection spark-connect
```
Copy

Add a connection by manually writing a connection file¶

You can manually write or update a connections.toml file so that your code can connect to Snowpark Connect for Spark on Snowflake.

Run the following command to ensure that your connections.toml file allows only the owner (user) to have read and write access.
```
chmod 0600 "~/.snowflake/connections.toml"
```
Copy
Edit the connections.toml file so that it contains a [spark-connect] connection with the connection properties in the following example.

Be sure to replace values with your own connection specifics.
```
[spark-connect]
host="my_snowflake_account.snowflakecomputing.com"
account="my_snowflake_account"
user="my_user"
password="&&&&&&&&"
warehouse="my_wh"
database="my_db"
schema="public"
```
Copy

Install Snowpark Connect for Spark¶

You can install Snowpark Connect for Spark as a Python package.

Create a Python virtual environment.

Confirm that your Python version is 3.10 or later and earlier than 3.13 by running python3 --version.
```
python3 -m venv .venv
source .venv/bin/activate
```
Copy

Install the Snowpark Connect for Spark package.

pip install --upgrade --force-reinstall 'snowpark-connect[jdk]'

Copy

Add Python code to start a Snowpark Connect for Spark server and create a Snowpark Connect for Spark session.

from snowflake import snowflake.snowpark_connect

# Import snowpark_connect *before* importing pyspark libraries
from pyspark.sql.types import Row

spark = snowflake.snowpark_connect.server.init_spark_session()

Copy

Run Python code from your client¶

Once you have an authenticated connection in place, you can write code as you normally would.

You can run PySpark code that connects to Snowpark Connect for Spark by using the PySpark client library.

from pyspark.sql import Row

  df = spark.createDataFrame([
      Row(a=1, b=2.),
      Row(a=2, b=3.),
      Row(a=4, b=5.),])

  print(df.count())

Copy

Run Scala code from your client¶

You can run Scala applications that connect to Snowpark Connect for Spark by using the Spark Connect client library.

This guide walks you through setting up Snowpark Connect and connecting your Scala applications to the Snowpark Connect for Spark server.

Step 1: Set up your Snowpark Connect for Spark environment¶

Set up your environment by using steps described in the following topics:

Create a Python virtual environment and install Snowpark Connect.
Set up a connection.

Step 2: Create a Snowpark Connect for Spark server script and launch the server¶

Create a Python script to launch the Snowpark Connect for Spark server.

# launch-snowpark-connect.py

from snowflake import snowpark_connect

def main():
    snowpark_connect.start_session(is_daemon=False, remote_url="sc://localhost:15002")
    print("SAS started on port 15002")

if __name__ == "__main__":
    main()

Copy

Launch the Snowpark Connect for Spark server.

# Make sure you're in the correct Python environment
pyenv activate your-snowpark-connect-env

# Run the server script
python launch-snowpark-connect.py

Copy

Step 3: Set up your Scala application¶

Add the Spark Connect client dependency to your build.sbt file.

libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "3.5.3"

// Add JVM options for Java 9+ module system compatibility
javaOptions ++= Seq(
  "--add-opens=java.base/java.nio=ALL-UNNAMED"
)

Copy

Execute Scala code to connect to the Snowpark Connect for Spark server.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.connect.client.REPLClassDirMonitor

object SnowparkConnectExample {
  def main(args: Array[String]): Unit = {
    // Create Spark session with Snowpark Connect
    val spark = SparkSession.builder().remote("sc://localhost:15002").getOrCreate()

    // Register ClassFinder for UDF support (if needed)
    // val classFinder = new REPLClassDirMonitor("target/scala-2.12/classes")
    // spark.registerClassFinder(classFinder)

    try {
      // Simple DataFrame operations
      import spark.implicits._

      val data = Seq(
        (1, "Alice", 25),
        (2, "Bob", 30),
        (3, "Charlie", 35)
      )

      val df = spark.createDataFrame(data).toDF("id", "name", "age")

      println("Original DataFrame:")
      df.show()

      println("Filtered DataFrame (age > 28):")
      df.filter($"age" > 28).show()

      println("Aggregated result:")
      df.groupBy().avg("age").show()

    } finally {
      spark.stop()
    }
  }
}

Copy

Compile and run your application.

# Compile your Scala application
sbt compile

# Run the application
sbt "runMain SnowparkConnectExample"

Copy

Scala UDF support on Snowpark Connect for Spark¶

When using user-defined functions or custom code, do one of the following:

import org.apache.spark.sql.connect.client.REPLClassDirMonitor

val classFinder = new REPLClassDirMonitor("/absolute/path/to/target/scala-2.12/classes")
spark.registerClassFinder(classFinder)

Copy

Upload JAR dependencies if needed.

spark.addArtifact("/absolute/path/to/dependency.jar")

Copy

Troubleshoot Snowpark Connect for Spark installation¶

With the following list of checks, you can troubleshoot Snowpark Connect for Spark installation and use.

Ensure that Java and Python are based on the same architecture.
Use the most recent Snowpark Connect for Spark package file, as described in Install Snowpark Connect for Spark.
Confirm that the python command with PySpark code is working correctly for local execution—that is, without Snowflake connectivity.

For example, execute a command such as the following:
```
python your_pyspark_file.py
```
Copy

Open source clients¶

You can use standard, off-the-shelf open source software (OSS) Spark client packages—such as PySpark and Spark clients for Java or Scala—from your preferred local environments, including Jupyter Notebooks and VS Code. In this way, you can avoid installing packages specific to Snowflake.

You might find this useful if you want to write Spark code locally and have the code use Snowflake compute resources and enterprise governance. In this scenario, you perform authentication and authorization through programmatic access tokens (PATs).

The following sections cover installation, configuration, and authentication. You’ll also find a simple PySpark example to validate your connection.

Step 1: Install Required Packages¶

Install pyspark. You don’t need to install any Snowflake packages.
```
pip install "pyspark[connect]>=3.5.0,<4"
```
Copy

Step 2: Setup and Authentication¶

Generate a programmatic access token (PAT).

For more information, see the following topics:
- Using programmatic access tokens for authentication
- ALTER USER … ADD PROGRAMMATIC ACCESS TOKEN (PAT)
The following example adds a PAT named TEST_PAT for the user sysadmin and sets the expiration to 30 days.
```
ALTER USER add PAT TEST_PAT ROLE_RESTRICTION = sysadmin DAYS_TO_EXPIRY = 30;
```
Copy

Find your Snowflake Spark Connect host URL.

Run the following SQL in Snowflake to find the hostname for your account:

SELECT t.VALUE:type::VARCHAR as type,
       t.VALUE:host::VARCHAR as host,
       t.VALUE:port as port
  FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t where type = 'SNOWPARK_CONNECT';

Copy

Step 3: Connect to Spark Connect server¶

To connect to the Spark Connect server, use code such as the following:

from pyspark.sql import SparkSession
import urllib.parse

# Replace with your actual PAT.
pat = urllib.parse.quote("<pat>", safe="")

# Replace with your Snowpark Connect host from the above SQL query.
snowpark_connect_host = ""

# Define database/schema/warehouse for executing your Spark session in Snowflake (recommended); otherwise, it will be resolved from your default_namespace and default_warehouse

db_name = urllib.parse.quote("TESTDB", safe="")
schema_name = urllib.parse.quote("TESTSCHEMA", safe="")
warehouse_name = urllib.parse.quote("TESTWH", safe="")

spark = SparkSession.builder.remote(f"sc://{snowpark_connect_host}/;token={pat};token_type=PAT;database={db_name};schema={schema_name};warehouse={warehouse_name}").getOrCreate()

# Spark session is ready to use. You can write regular Spark DataFrame code, as in the following example:

from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=1, b=2.),
    Row(a=2, b=3.),
    Row(a=4, b=5.),])
print(df.count())

Copy