Using the Snowpark Python JDBC¶
With the Snowpark Python JDBC, Snowpark Python users can programmatically pull data from external databases into Snowflake. This allows you to connect to external databases using JDBC drivers.
With these APIs, you can seamlessly pull data into Snowflake tables and transform it using Snowpark DataFrames for advanced analytics.
The JDBC can be used in a similar way as the Spark JDBC API. Most parameters are designed to be identical or similar for better parity. For more information that compares the Snowpark Python JDBC with the Spark JDBC API, see the following table:
Snowpark JDBC parameters¶
| Parameter | Snowpark Python JDBC |
|---|---|
url | A connection string used to connect to the external data source via the JDBC driver |
udtf_configs | A dictionary containing the necessary configurations for the UDTF creation |
properties | A dictionary containing the key-value pair that is needed during establishing JDBC connection |
table | Table in the source database |
query | SQL query wrapped as a subquery for reading data |
column | Partitioning column for parallel reads |
lower_bound | Lower bound for partitioning |
upper_bound | Upper bound for partitioning |
num_partitions | Number of partitions for parallelism |
query_timeout | The timeout duration for SQL execution, measured in seconds. |
fetch_size | Number of rows fetched per round trip |
custom_schema | Custom schema for pulling data from external databases |
predicates | List of conditions for WHERE clause partitions |
session_init_statement | Executes a SQL or PL/SQL statement upon session initialization |
Understanding parallelism¶
Snowpark Python JDBC currently has one form of underlying ingestion mechanism:
- UDTF ingestion
All workloads run on the Snowflake server. Snowpark creates a Java UDTF and invoke it in parallel to ingest data into a Snowflake temporary table. Thus the
udtf_configsparameter is required for this feature.
The Snowpark Python JDBC has two ways to parallelize and accelerate ingestion:
- Partition column
This method divides source data into a number of partitions based on four parameters when users call
jdbc():columnlower_boundupper_boundnum_partitions
These four parameters have to be set at the same time and the
columnmust be numeric or date type.- Predicates
This method divides source data into partitions based on parameter predicates, which are a list of expressions suitable for inclusion in
WHEREclauses, where each expression defines a partition. Predicates provide a more flexible way of dividing partitions; for example, you can divide partitions on boolean or non-numeric columns.
The Snowpark Python JDBC also allows adjusting parallelism level within a partition:
- Fetch_size
Within a partition, the API fetches rows in chunks defined by
fetch_size. These rows are written to Snowflake in parallel as they are fetched, which allows reading and writing to overlap and maximizes throughput.
Using JDBC to ingest data from external data source¶
Using JDBC to ingest data from a Snowpark client¶
-
Upload the JDBC driver jar file to a Snowflake stage using Snowpark or Snowsight
-
Upload using Snowpark. In Snowpark, after creating a session, run the following code:
-
Upload using Snowsight as described in the following steps.
- In Snowsight, click on Catalog -> Database Explorer.
- In the left search bar of databases, click on [your database name] -> [your schema name] -> stages -> [your stage name].
- Click the “+File” button on the top right corner of the stage page.
-
-
Configure the secret, network rule, and external access integration.
-
Pull data from the target using Snowpark JDBC from a Snowpark client.
Using JDBC to ingest data from a stored procedure¶
-
Upload JDBC driver jar file to Snowflake stage using Snowsight
- In Snowsight, click on Catalog -> Database Explorer
- In the left search bar of databases, click [your database name] -> [your schema name] -> stages -> [your stage name].
- Click the “+File” button on the top right corner of the stage page.
-
Configure secret, network rule, and external access integration.
-
Pull data from target using Snowpark JDBC from a stored procedure.
Using JDBC to ingest data from a Snowflake notebook¶
-
Upload JDBC driver jar file to Snowflake stage using Snowsight
- In Snowsight, click on Catalog -> Database Explorer
- In the left search bar of databases, click [your database name] -> [your schema name] -> stages -> [your stage name].
- Click the “+File” button on the top right corner of the stage page.
-
Configure secret, network rule, and external access integration.
-
Pull data from target using Snowpark JDBC from a Snowflake notebook.
Source tracing¶
Source tracing when using Snowpark JDBC connect to MySQL¶
-
Include a tag of Snowpark in your create connection function:
-
Run the following SQL in your data source to capture queries from Snowpark that are still live:
Source tracing when using Snowpark JDBC to connect to SQL Server¶
-
Include a tag of Snowpark in your create connection function:
-
Run the following SQL in your data source to capture queries from Snowpark that are still live:
Source tracing when using Snowpark JDBC to connect to PostgresSQL¶
-
Include a tag of Snowpark in your create connection function:
-
Run the following SQL in your data source to capture queries from Snowpark that are still live:
Source tracing when using Snowpark JDBC to connect to Oracle¶
-
Include a tag of Snowpark in your create connection function:
-
Run the following SQL in your data source to capture queries from Snowpark that are still live:
Common DBMS and Type Support¶
The following is a certified list of data types of different DBMS systems. If your source data involves other data types, Snowpark Python JDBC will try to map them to best-effort Snowflake data types, or fall back to strings.
Oracle¶
- INTEGER
- NUMBER
- BINARY_FLOAT
- BINARY_DOUBLE
- VARCHAR2
- CHAR
- CLOB
- NCHAR
- NVARCHAR2
- NCLOB
- DATE
- TIMESTAMP
- TIMESTAMP WITH TIME ZONE
- TIMESTAMP WITH LOCAL TIME ZONE
- RAW
PostgresSQL¶
- BIGINT
- BIGSERIAL
- BIT
- BIT VARYING
- BOOLEAN
- BOX
- BYTEA
- CHAR
- VARCHAR
- CIDR
- CIRCLE
- DATE
- DOUBLE PRECISION
- INET
- INTEGER
- INTERVAL
- JSON
- JSONB
- LINE
- LSEG
- MACADDR
- POINT
- POLYGON
- REAL
- SMALLINT
- SMALLSERIAL
- SERIAL
- TEXT
- TIME
- TIMESTAMP
- TIMESTAMPTZ
- TSQUERY
- TSVECTOR
- TXID_SNAPSHOT
- UUID
- XML
MySQL¶
- INT
- DECIMAL
- INT
- TINYINT
- SMALLINT
- MEDIUMINT
- BIGINT
- YEAR
- FLOAT
- DOUBLE
- CHAR
- VARCHAR
- TINYTEXT
- TEXT
- MEDIUMTEXT
- LONGTEXT
- ENUM
- SET
- BIT
- BINARY
- VARBINARY
- TINYBLOB
- BLOB
- MEDIUMBLOB
- LONGBLOB
- DATE
- DATETIME
- TIMESTAMP
- TIME
- JSON
SQL Server¶
- INT
- BIGINT
- INT
- SMALLINT
- TINYINT
- BIT
- DECIMAL
- NUMERIC
- MONEY
- SMALLMONEY
- FLOAT
- REAL
- DATE
- TIME
- DATETIME
- DATETIME2
- SMALLDATETIME
- CHAR
- VARCHAR
- VARCHAR(MAX)
- TEXT
- NCHAR
- NVARCHAR
- NVARCHAR(MAX)
- NTEXT
- BINARY
- VARBINARY
- VARBINARY(MAX)
- IMAGE
- UNIQUEIDENTIFIER
- TIMESTAMP
Databricks¶
Connecting to Databricks using Snowpark Python JDBC is currently not supported.