Run Spark workloads from Snowflake Notebooks¶
You can run Spark workloads interactively from Snowflake Notebooks without needing to manage a Spark cluster. The workloads run on the Snowflake infrastructure.
To use Snowflake Notebooks as a client for developing Spark workloads to run on Snowflake:
Launch Snowflake Notebooks.
Within the notebook, start a Spark session.
Write PySpark code to load, transform, and analyze data—such as to filter high-value customer orders or aggregate revenue.
Use a Snowflake Notebook that runs on a warehouse¶
For more information about Snowflake Notebooks, see Create a notebook.
Create a Snowflake Notebook by completing the following steps:
Sign in to Snowsight.
From the navigation menu, select + Create » Notebook » New Notebook.
In the Create notebook dialog, enter a name, database, and schema for the new notebook.
For more information, see Create a notebook.
For Runtime, select Run on warehouse.
For Runtime version, select Snowflake Warehouse Runtime 2.0.
When you select version 2.0, you ensure that you have the dependency support you need, including Python 3.10. For more information, see Notebook runtimes.
For Query warehouse and Notebook warehouse, select warehouses for running query code and kernel and Python code, as described in Create a notebook.
Select Create.
In the notebook you created, under Packages, ensure that you have the following packages listed to support code in your notebook:
Python, version 3.10 or later
snowflake-dataframe-processor, latest version
If you need to add these packages, use the following steps:
Under Anaconda Packages, type the packages name in the search box.
Select the package name.
Select Save.
To connect to the Snowpark Connect for Spark server and test the connection, copy the following code and paste it in the Python cell of the notebook you created:
# Set up the env for Java libraries and enable the Spark Connect Mode import os os.environ['JAVA_HOME'] = os.environ["CONDA_PREFIX"] os.environ['JAVA_LD_LIBRARY_PATH'] = os.path.join(os.environ["CONDA_PREFIX"], 'lib', 'server') os.environ["SPARK_LOCAL_HOSTNAME"] = "127.0.0.1" os.environ["SPARK_CONNECT_MODE_ENABLED"] = "1" from snowflake import snowpark_connect from snowflake.snowpark.context import get_active_session import traceback session = get_active_session() snowpark_connect.start_session(snowpark_session = session)
To add a new cell for Python code, hover over the cell that has the code you just pasted, and then select + Python .
To run the code that uses Snowpark Connect for Spark, copy the following code, and then paste it into the new Python cell that you added.
# Here is your normal pyspark code. You can of course have them in other Python Cells spark = snowpark_connect.get_session() df = spark.sql("show schemas").limit(10) df.show()