Run Spark workloads from Jupyter Notebooks¶
You can run Spark workloads interactively from Jupyter Notebooks, VS Code, or any Python-based interface without needing to manage a Spark cluster. The workloads run on the Snowflake infrastructure.
For example, you can do the following tasks:
Confirm that you have prerequisites.
Set up your environment to connect with Snowpark Connect for Spark on Snowflake.
Install Snowpark Connect for Spark.
Run PySpark code from your client to run on Snowflake.
Prerequisites¶
Confirm that your Python and Java installations are based on the same computer architecture. For example, if Python is based is arm64, Java must also be arm64 (not x86_64, for example).
Set up your environment¶
You can set up your development environment by ensuring the your code can connect to Snowpark Connect for Spark on Snowflake. To connect to Snowflake
client code will use a .toml
file containing connection details.
If you have Snowflake CLI installed, you can use it to define a connection. Otherwise, you can manually write connection parameters in a
config.toml
file.
Add a connection by using Snowflake CLI¶
You can use Snowflake CLI to add connection properties that Snowpark Connect for Spark can use to connect to Snowflake. Your changes are saved to a
config.toml
file.
Run the following command to add a connection using the snow connection add command.
snow connection add
This command adds a connection to your
config.toml
file, as in the following example:[connections.spark-connect] host = "example.snowflakecomputing.com" port = 443 account = "example" user = "test_example" password = "password" protocol = "https" warehouse = "example_wh" database = "example_db" schema = "public"
Run the following command to confirm that the connection works.
You can test the connection in this way when you’ve added it by using Snowflake CLI.
snow connection list snow connection test --connection spark-connect
Add a connection by manually writing a connection file¶
You can manually write or update a connections.toml
file so that your code can connect to Snowpark Connect for Spark on Snowflake.
Run the following command to ensure that your
connections.toml
file allows only the owner (user) to have read and write access.chmod 0600 "~/.snowflake/connections.toml"
Edit the
connections.toml
file so that it contains a[spark-connect]
connection with the connection properties in the following example.Be sure to replace values with your own connection specifics.
[spark-connect] host="my_snowflake_account.snowflakecomputing.com" account="my_snowflake_account" user="my_user" password="&&&&&&&&" warehouse="my_wh" database="my_db" schema="public"
Install Snowpark Connect for Spark¶
You can install Snowpark Connect for Spark as a Python package.
Create a Python virtual environment.
For example, you can use Conda, as in the following example.
conda create -n xxxx pip python=3.12 conda activate xxxx
Install the Snowpark Connect for Spark package.
pip install --upgrade --force-reinstall snowpark-connect
Add Python code to start a Snowpark Connect for Spark server and create a Snowpark Connect for Spark session.
import os from snowflake import snowpark_connect from pyspark.sql import SparkSession os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1" snowpark_connect.start_session() # Start the local |spconnect| session spark = snowpark_connect.get_session()
Run PySpark code from your client¶
Once you have an authenticated connection in place, you can write PySpark code in a Jupyter Notebook as you normally would.
For example, you can execute the following simple example:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=1, b=2.),
Row(a=2, b=3.),
Row(a=4, b=5.),])
print(df.count())
Troubleshoot Snowpark Connect for Spark installation¶
With the following list of checks, you can troubleshoot Snowpark Connect installation and use.
Ensure that Java and Python are based on the same architecture.
Use the most recent Snowpark Connect for Spark package
.whl
file.Confirm that the python command with PySpark code is working correctly for local execution—that is, without Snowflake connectivity.
For example, run a command such as the following:
python your_pyspark_file.py