Develop with a local IDE

You can run Spark workloads interactively from Jupyter Notebooks, VS Code, IntelliJ, or any Python/Java/Scala interface without needing to manage a Spark cluster. The workloads run on the Snowflake infrastructure.

There are two ways to connect:

  • Snowpark Connect package (recommended): Install the snowpark-connect Python package, which is required for all languages (Python, Java, and Scala). For Java and Scala projects, also add the snowpark-connect-java-client Maven dependency. For establishing a connection, use a TOML connection file. This approach handles server lifecycle, authentication, and session management automatically.

  • Direct endpoint (server-side): Connect to Snowflake’s hosted Spark Connect endpoint using standard PySpark or Spark Java/Scala clients with programmatic access tokens (PATs). No Snowflake-specific packages are required.

Prerequisites

  • You have a Snowflake account with access to Snowpark Connect for Spark.

  • Python 3.10 or later (earlier than 3.13) is installed. Confirm your version by running python3 --version.

  • Ensure that your Java and Python installations use the same CPU architecture. For example, if Python is arm64, install an arm64 build of Java (not x86_64).

Connection configuration

Snowpark Connect for Spark connects to Snowflake using a TOML connection file. You can create this file manually or by using Snowflake CLI.

If you have Snowflake CLI installed, you can use it to define a connection. Otherwise, you can manually write connection parameters in a config.toml file.

Add a connection by using Snowflake CLI

You can use Snowflake CLI to add connection properties that Snowpark Connect for Spark uses to connect to Snowflake. Your changes are saved to a config.toml file.

  1. Run the following command to add a connection:

    snow connection add
    
  2. Follow the prompts to define a connection.

    Specify spark-connect as the connection name.

    This command adds a connection to your config.toml file:

    [connections.spark-connect]
    host = "example.snowflakecomputing.com"
    port = 443
    account = "example"
    user = "test_example"
    password = "password"
    protocol = "https"
    warehouse = "example_wh"
    database = "example_db"
    schema = "public"
    
  3. Confirm the connection works:

    snow connection list
    snow connection test --connection spark-connect
    

Add a connection manually

You can write or update a connections.toml file so that your code can connect to Snowpark Connect for Spark on Snowflake.

  1. Ensure that the file permissions allow only the owner to read and write:

    chmod 0600 "~/.snowflake/connections.toml"
    
  2. Edit the file to contain a [spark-connect] connection with your specifics:

    [spark-connect]
    host="my_snowflake_account.snowflakecomputing.com"
    account="my_snowflake_account"
    user="my_user"
    password="&&&&&&&&"
    warehouse="my_wh"
    database="my_db"
    schema="public"
    

Install Snowpark Connect for Spark

Create a Python virtual environment and install the Snowpark Connect for Spark package:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade --force-reinstall 'snowpark-connect[jdk]'

Start a session and run code

Once you have Snowpark Connect for Spark installed and an authenticated connection in place, start a session and run Spark code.

Start the Snowpark Connect for Spark server and create a session:

from snowflake import snowpark_connect
spark = snowpark_connect.init_spark_session()

Then run Spark DataFrame code:

from pyspark.sql import Row

df = spark.createDataFrame([
    Row(id=1, name="Alice", age=25),
    Row(id=2, name="Bob", age=30),
    Row(id=3, name="Charlie", age=35),
])

df.show()
df.filter(df.age > 28).show()
print(df.count())

Common installation issues

Use the following checks to resolve common Snowpark Connect for Spark installation issues.

  • Ensure that Java and Python are based on the same architecture.

  • Use the most recent Snowpark Connect for Spark package, as described in Install Snowpark Connect for Spark.

  • Confirm that the python command with PySpark code is working correctly for local execution without Snowflake connectivity.

    For example, execute a command such as the following:

    python your_pyspark_file.py
    

Connect directly to Snowflake’s Spark Connect endpoint

You can connect to Snowflake’s hosted Spark Connect endpoint using standard, off-the-shelf Spark client packages such as PySpark or Spark clients for Java and Scala. You don’t need to install any Snowflake-specific packages.

With this approach, all Spark processing runs on Snowflake’s infrastructure. Your client sends Spark Connect protocol messages directly to Snowflake, which executes the workload and returns results. Authentication uses programmatic access tokens (PATs).

This option is useful when you want to:

  • Avoid installing Snowflake-specific packages in your environment.

  • Use your existing Spark tooling (Jupyter, VS Code, terminals) with Snowflake compute and governance.

  • Simplify dependency management by relying only on the standard PySpark package.

Step 1: Install required packages

Install the Spark Connect client for your language. You don’t need to install any Snowflake packages.

pip install "pyspark[connect]>=3.5.0,<4"

Step 2: Set up authentication

  1. Generate a programmatic access token (PAT).

    For more information, see the following topics:

    The following example adds a PAT named TEST_PAT for the user sysadmin and sets the expiration to 30 days.

    ALTER USER add PAT TEST_PAT ROLE_RESTRICTION = sysadmin DAYS_TO_EXPIRY = 30;
    
  2. Find your Snowflake Spark Connect host URL.

    Run the following SQL in Snowflake to find the hostname for your account:

    SELECT t.VALUE:type::VARCHAR as type,
           t.VALUE:host::VARCHAR as host,
           t.VALUE:port as port
      FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t where type = 'SNOWPARK_CONNECT';
    

Step 3: Connect and run Spark code

Connect to the Snowflake Spark Connect endpoint using the host URL and PAT from the previous steps.

from pyspark.sql import SparkSession
import urllib.parse

# Replace with your actual PAT.
pat = urllib.parse.quote("<pat>", safe="")

# Replace with your Snowpark Connect host from the above SQL query.
snowpark_connect_host = ""

# Define database/schema/warehouse for your Spark session (recommended).
# Otherwise, values are resolved from your default_namespace and default_warehouse.
db_name = urllib.parse.quote("TESTDB", safe="")
schema_name = urllib.parse.quote("TESTSCHEMA", safe="")
warehouse_name = urllib.parse.quote("TESTWH", safe="")

spark = SparkSession.builder.remote(
    f"sc://{snowpark_connect_host}/;token={pat};token_type=PAT"
    f";database={db_name};schema={schema_name};warehouse={warehouse_name}"
).getOrCreate()

Once connected, you can write regular Spark DataFrame code:

from pyspark.sql import Row

df = spark.createDataFrame([
    Row(id=1, name="Alice", age=25),
    Row(id=2, name="Bob", age=30),
    Row(id=3, name="Charlie", age=35),
])

df.show()
df.filter(df.age > 28).show()
print(df.count())