Using the Snowpark Python DB-API

Snowpark Python DB-APIを使用すると、Snowpark Pythonユーザーは、外部データベースからSnowflakeにデータをプログラムでプルできます。説明には次が含まれます。

  • Python DB-API support: Connect to external databases using Python's standard DB-API 2.0 drivers.

  • 合理化された設定: pip を使用して必要なドライバーをインストールするため、追加の依存関係を管理する必要がありません。

これらの APIsのデータをシームレスに Snowpark DataFrames テーブルに取り込み、Snowflakeを使用して高度な分析のための変換することができます。

The DB-API can be used in a similar way as the Spark JDBC API. Most parameters are designed to be identical or similar for better parity. At the same time, Snowpark emphasizes a Python-first design with intuitive naming conventions that avoid JDBC-specific configurations. This provides Python developers with a familiar experience. For more information that compares the Snowpark Python DB-API with the Spark JDBC API, see the following table:

DB-APIパラメーター:

パラメーター

Snowpark Python DB-API

create_connection

PythonをDB-API 接続を作成する関数 。

table

ソースデータベースのテーブルを指定します。

query

SQL クエリは、データを読み取るためのサブクエリとしてラップされました。

column

並列読み取りのパーティション列。

lower_bound

パーティション分割の下限。

upper_bound

パーティション分割の上限。

num_partitions

並列処理のパーティションの数。

query_timeout

SQL 実行のタイムアウト(秒単位)。

fetch_size

ラウンドトリップごとに取得された行数。

custom_schema

外部データベースからデータをプルするためのカスタムスキーマ。

max_workers

外部データベースからのデータの並行フェッチおよびプルのワーカー数。

predicates

WHERE 句のパーティションの条件リスト。

session_init_statement

セッション初期化時の SQL または PL/SQL ステートメントを実行します。

udtf_configs

Executes the workload using a Snowflake UDTF for better performance.

fetch_merge_count

Number of fetched batches to be merged into a single Parquet file before it is uploaded.

並列処理の理解

Snowpark Python DB-API has two forms of ingestion mechanism underlying.

ローカルインジェスチョン

ローカルインジェスチョンでは、Snowparkはまず外部ソースからローカル環境にデータをフェッチします。そこで dbapi() 関数が呼び出され、Parquetファイルに変換されます。次に、SnowparkはこれらのParquetファイルを仮のSnowflakeステージにアップロードし、それらをステージから仮テーブルにコピーします。

UDTF インジェスチョン

UDTF インジェスチョンでは、すべてのワークロードがSnowflakeサーバー上で実行されます。Snowparkは最初に UDTF を作成し、実行します。 UDTF はデータをSnowflakeに直接インジェストし、仮テーブルに格納します。

Snowpark Python DB-API also has two ways to parallelize and accelerate ingestion.

パーティション列

このメソッドは、ユーザーが dbapi() を呼び出すと、4つのパラメーターに基づいてソースデータを多数のパーティションに分割します。

  • column

  • lower_bound

  • upper_bound

  • num_partitions

これらの4つのパラメーターは同時に設定する必要があります。 column は、数値または日付型である必要があります。

Predicates

このメソッドは、パラメーター述語に基づいてソースデータをパーティションに分割します。パラメーター述語とは WHERE 句に含まれるのに適した式のリストで、各式はパーティションを定義します。述語は、より柔軟にパーティションを分割できます。たとえば、ブール値または非数値の列でパーティションを分割できます。

Snowpark Python DB-API は、パーティション内の並列レベルを調整することもできます。

Fetch_size

Within a partition, the API fetches rows in chunks defined by fetch_size. These rows are written to Snowflake in parallel as they are fetched, allowing reading and writing to overlap and maximize throughput.

上記のインジェスチョンの方法と並列処理を組み合わせることで、Snowflakeには4つのインジェスチョンの方法があります。

  • パーティション列を使用したローカルインジェスチョン

    df_local_par_column = session.read.dbapi(
        create_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # Swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    Copy
  • 述語を使用したローカルインジェスチョン

    df_local_predicates = session.read.dbapi(
        create_connection,
        table="target_table",
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    Copy
  • パーティション列を使用した UDTF インジェスチョン

    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    df_udtf_par_column = session.read.dbapi(
        create_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # Swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    Copy
  • 述語を使用した UDTF インジェスチョン

    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    df_udtf_predicates = session.read.dbapi(
        create_dbx_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    Copy

SQL server

To connect to SQL Server from Snowpark, you need the following three packages:

The following code examples show how to connect to SQL Server from a Snowpark client and a stored procedure.

Use the DB-API to connect to SQL Server from a Snowpark client

  1. Install the Python SQL Driver:

    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
    brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
    brew update
    HOMEBREW_ACCEPT_EULA=Y brew install msodbcsql18 mssql-tools18
    
    Copy
  2. Install snowflake-snowpark-python[pandas] and pyodbc:

    pip install snowflake-snowpark-python[pandas]
    pip install pyodbc
    
    Copy
  3. Define the factory method for creating a connection to SQL Server:

    def create_sql_server_connection():
        import pyodbc
        SERVER = "<your host name>"
        PORT = <your port>
        UID = "<your user name>"
        PWD = "<your password>"
        DATABASE = "<your database name>"
        connection_str = (
            f"DRIVER={{ODBC Driver 18 for SQL Server}};"
            f"SERVER={SERVER}:{PORT};"
            f"UID={UID};"
            f"PWD={PWD};"
            f"DATABASE={DATABASE};"
            "TrustServerCertificate=yes"
            "Encrypt=yes"
            # Optional to identify source of queries
            "APP=snowflake-snowpark-python;"
        )
        connection = pyodbc.connect(connection_str)
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_sql_server_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_sql_server_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_sql_server_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # Swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_sql_server_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    Copy

DB-API を使用して、ストアドプロシージャから SQL サーバーに接続します

  1. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  2. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    -- Configure a secret to allow egress to the source endpoint
    
    CREATE OR REPLACE SECRET mssql_secret
    TYPE = PASSWORD
    USERNAME = 'mssql_username'
    PASSWORD = 'mssql_password';
    
    -- Configure a network rule to allow egress to the source endpoint
    
    CREATE OR REPLACE NETWORK RULE mssql_network_rule
    MODE = EGRESS
    TYPE = HOST_PORT
    VALUE_LIST = ('mssql_host:mssql_port');
    
    -- Configure an external access integration
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION mssql_access_integration
    ALLOWED_NETWORK_RULES = (mssql_network_rule)
    ALLOWED_AUTHENTICATION_SECRETS = (mssql_secret)
    ENABLED = true;
    
    Copy
  3. Use the DB-API to pull data from SQL Server in a Python stored procedure:

    CREATE OR REPLACE PROCEDURE sp_mssql_dbapi()
        RETURNS TABLE()
        LANGUAGE PYTHON
        RUNTIME_VERSION='3.11'
        HANDLER='run'
        PACKAGES=('snowflake-snowpark-python', 'pyodbc', 'msodbcsql')
        EXTERNAL_ACCESS_INTEGRATIONS = (mssql_access_integration)
        SECRETS = ('cred' = mssql_secret )
    AS $$
    
    # Get user name and password from mssql_secret
    
    import _snowflake
    username_password_object = _snowflake.get_username_password('cred')
    USER = username_password_object.username
    PASSWORD = username_password_object.password
    
    # Define a method to connect to SQL server_hostname
    from snowflake.snowpark import Session
    def create_sql_server_connection():
        import pyodbc
    
        host = "<your host>"
        port = <your port>
        username = USER
        password = PASSWORD
        database = "<your database name>"
        connection_str = (
            f"DRIVER={{ODBC Driver 18 for SQL Server}};"
            f"SERVER={host},{port};"
            f"DATABASE={database};"
            f"UID={username};"
            f"PWD={password};"
            "TrustServerCertificate=yes"
            "Encrypt=yes"
            # Optional to identify source of queries
            "APP=snowflake-snowpark-python;"
        )
    
        connection = pyodbc.connect(connection_str)
        return connection
    
    def run(session: Session):
        # Feel free to combine local/udtf ingestion and partition column/predicates
        # as stated in the understanding parallelism section
    
        # Call dbapi to pull data from target table
    
        df = session.read.dbapi(
            create_sql_server_connection,
            table="target_table"
        )
    
        # Call dbapi to pull data from target query
    
        df_query = session.read.dbapi(
            create_sql_server_connection,
            query="select * from target_table"
        )
    
        # Pull data from target table with parallelism using partition column
    
        df_local_par_column = session.read.dbapi(
            create_sql_server_connection,
            table="target_table",
            fetch_size=100000,
            num_partitions=4,
            column="ID",  # swap with the column you want your partition based on
            upper_bound=10000,
            lower_bound=0
        )
    
        udtf_configs = {
            "external_access_integration": "<your external access integration>"
        }
    
        # Pull data from target table with udtf ingestion with parallelism using predicates
    
        df_udtf_predicates = session.read.dbapi(
            create_sql_server_connection,
            table="target_table",
            udtf_configs=udtf_configs,
            fetch_size=100000,
            predicates=[
                "ID < 3",
                "ID >= 3"
            ]
        )
    
        return df
    $$;
    
    CALL sp_mssql_dbapi();
    
    Copy

Using DB-API to connect to SQL server from a Snowflake notebook

  1. From Snowflake Notebook packages, select snowflake-snowpark-python and pyodbc.

  2. 左側のファイルタブで、 environment.yml ファイルを開き、依存関係の下にある他のエントリの後に次のコード行を追加します。

    - msodbcsql18
    
    Copy
  3. Configure the secret, a network rule to allow egress to the source endpoint, and external access integration:

    -- Configure a secret to allow egress to the source endpoint
    
    CREATE OR REPLACE SECRET mssql_secret
    TYPE = PASSWORD
    USERNAME = 'mssql_username'
    PASSWORD = 'mssql_password';
    
    ALTER NOTEBOOK mynotebook SET SECRETS = ('snowflake-secret-object' = mssql_secret);
    
    -- Configure a network rule to allow egress to the source endpoint
    
    CREATE OR REPLACE NETWORK RULE mssql_network_rule
    MODE = EGRESS
    TYPE = HOST_PORT
    VALUE_LIST = ('mssql_host:mssql_port');
    
    -- Configure an external access integration
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION mssql_access_integration
    ALLOWED_NETWORK_RULES = (mssql_network_rule)
    ALLOWED_AUTHENTICATION_SECRETS = (mssql_secret)
    ENABLED = true;
    
    Copy
  4. Snowflake Notebooks の外部アクセスの設定, and then restart the notebook session.

  5. Use the DB-API to pull data from SQL Server in a Python cell of a Snowflake notebook:

    # Get user name and password from mssql_secret
    
    import _snowflake
    username_password_object = _snowflake.get_username_password('snowflake-secret-object')
    USER = username_password_object.username
    PASSWORD = username_password_object.password
    
    import snowflake.snowpark.context
    session = snowflake.snowpark.context.get_active_session()
    
    def create_sql_server_connection():
        import pyodbc
        SERVER = SQL_SERVER_CONNECTION_PARAMETERS["SERVER"]
        UID = SQL_SERVER_CONNECTION_PARAMETERS["UID"]
        PWD = SQL_SERVER_CONNECTION_PARAMETERS["PWD"]
        DATABASE = "test_query_history"
        connection_str = (
            f"DRIVER={{ODBC Driver 18 for SQL Server}};"
            f"SERVER={SERVER};"
            f"UID={UID};"
            f"PWD={PWD};"
            f"DATABASE={DATABASE};"
            "TrustServerCertificate=yes;"
            "Encrypt=yes;"
            # Optional to identify source of queries
            "APP=snowflake-snowpark-python;"
        )
        connection = pyodbc.connect(connection_str)
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_sql_server_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_sql_server_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_sql_server_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_sql_server_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    # Save data into sf_table
    df.write.mode("overwrite").save_as_table('sf_table')
    
    Copy

Source tracing when using DB-API to connect to SQL server

  1. Create接続関数にSnowparkのタグを含めます。

    def create_sql_server_connection():
        import pyodbc
        SERVER = "<your host name>"
        PORT = <your port>
        UID = "<your user name>"
        PWD = "<your password>"
        DATABASE = "<your database name>"
        connection_str = (
            f"DRIVER={{ODBC Driver 18 for SQL Server}};"
            f"SERVER={SERVER}:{PORT};"
            f"UID={UID};"
            f"PWD={PWD};"
            f"DATABASE={DATABASE};"
            "TrustServerCertificate=yes"
            "Encrypt=yes"
            # include this parameter for source tracing
            "APP=snowflake-snowpark-python;"
        )
        connection = pyodbc.connect(connection_str)
        return connection
    
    Copy
  2. データソースで以下の SQL を実行し、まだ存続しているSnowparkからのクエリをキャプチャします。

    SELECT
        s.session_id,
        s.program_name,
        r.status,
        t.text AS sql_text
    FROM sys.dm_exec_sessions s
    JOIN sys.dm_exec_requests r ON s.session_id = r.session_id
    CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
    WHERE s.program_name = 'snowflake-snowpark-python';
    
    Copy

Oracle

To connect to Oracle from Snowpark, you need the following two packages:

The following code examples show how to connect to Oracle from a Snowpark client, stored procedures, and a Snowflake notebook.

Use the DB-API to connect to Oracle from a Snowpark client

  1. Install snowflake-snowpark-python[pandas] and oracledb:

    pip install snowflake-snowpark-python[pandas]
    pip install oracledb
    
    Copy
  2. Use the DB-API to pull data from Oracle and define the factory method for creating a connection to Oracle:

    def create_oracle_db_connection():
        import oracledb
        HOST = "<your host>"
        PORT = <your port>
        SERVICE_NAME = "<your service name>"
        USER = "<your user name>"
        PASSWORD = "your password"
        DSN = f"{HOST}:{PORT}/{SERVICE_NAME}"
        connection = oracledb.connect(
            user=USER,
            password=PASSWORD,
            dsn=DSN
        )
        # Optional: include this parameter for source tracing
        connection.clientinfo = "snowflake-snowpark-python"
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_oracle_db_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_oracle_db_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_oracle_db_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_oracle_db_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    Copy

DB-API を使用しようして、ストアドプロシージャからOracleに接続します

  1. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  2. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    -- Configure the secret, a network rule to allow egress to the source endpoint, and EAI:
    
    CREATE OR REPLACE SECRET ora_secret
    TYPE = PASSWORD
    USERNAME = 'ora_username'
    PASSWORD = 'ora_password';
    
    -- configure a network rule to allow egress to the source endpoint
    
    CREATE OR REPLACE NETWORK RULE ora_network_rule
    MODE = EGRESS
    TYPE = HOST_PORT
    VALUE_LIST = ('ora_host:ora_port');
    
    -- configure an external access integration
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION ora_access_integration
    ALLOWED_NETWORK_RULES = (ora_network_rule)
    ALLOWED_AUTHENTICATION_SECRETS = (ora_secret)
    ENABLED = true;
    
    Copy
  3. Use Snowpark Python DB-API to pull data from Oracle in a Python stored procedure:

    CREATE OR REPLACE PROCEDURE sp_ora_dbapi()
        RETURNS TABLE()
        LANGUAGE PYTHON
        RUNTIME_VERSION='3.11'
        HANDLER='run'
        PACKAGES=('snowflake-snowpark-python', 'oracledb')
        EXTERNAL_ACCESS_INTEGRATIONS = (ora_access_integration)
        SECRETS = ('cred' = ora_secret )
    AS $$
    
    # Get user name and password from ora_secret
    import _snowflake
    username_password_object = _snowflake.get_username_password('cred')
    USER = username_password_object.username
    PASSWORD = username_password_object.password
    
    # Define the factory method for creating a connection to Oracle
    
    from snowflake.snowpark import Session
    
    def create_oracle_db_connection():
        import oracledb
        host = "ora_host"
        port = "ora_port"
        service_name = "ora_service"
        user = USER
        password = PASSWORD
        DSN = f"{host}:{port}/{service_name}"
        connection = oracledb.connect(
            user=USER,
            password=PASSWORD,
            dsn=DSN
        )
        # Optional: include this parameter for source tracing
        connection.clientinfo = "snowflake-snowpark-python"
        return connection
    
    def run(session: Session):
        # Feel free to combine local/udtf ingestion and partition column/predicates
        # as stated in the understanding parallelism section
    
        # Call dbapi to pull data from target table
    
        df = session.read.dbapi(
            create_oracle_db_connection,
            table="target_table"
        )
    
        # Call dbapi to pull data from target query
    
        df_query = session.read.dbapi(
            create_oracle_db_connection,
            query="select * from target_table"
        )
    
        # Pull data from target table with parallelism using partition column
    
        df_local_par_column = session.read.dbapi(
            create_oracle_db_connection,
            table="target_table",
            fetch_size=100000,
            num_partitions=4,
            column="ID",  # swap with the column you want your partition based on
            upper_bound=10000,
            lower_bound=0
        )
    
        udtf_configs = {
            "external_access_integration": "<your external access integration>"
        }
    
        # Pull data from target table with udtf ingestion with parallelism using predicates
    
        df_udtf_predicates = session.read.dbapi(
            create_oracle_db_connection,
            table="target_table",
            udtf_configs=udtf_configs,
            fetch_size=100000,
            predicates=[
                "ID < 3",
                "ID >= 3"
            ]
        )
        return df
    $$;
    
    CALL sp_ora_dbapi();
    
    Copy

Using DB-API to connect to Oracle from a Snowflake notebook

  1. From Snowflake Notebook packages, select snowflake-snowpark-python and oracledb.

  2. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  3. Configure the secret, a network rule, and EAI to allow egress to the source endpoint:

    -- Configure the secret, a network rule to allow egress to the source endpoint, and EAI:
    CREATE OR REPLACE SECRET mysql_secret
        TYPE = PASSWORD
        USERNAME = 'mysql_username'
        PASSWORD = 'mysql_password';
    ALTER NOTEBOOK mynotebook SET SECRETS = ('snowflake-secret-object' = mysql_secret);
    
    -- configure a network rule to allow egress to the source endpoint
    
    CREATE OR REPLACE NETWORK RULE mysql_network_rule
        MODE = EGRESS
        TYPE = HOST_PORT
        VALUE_LIST = ('mysql_host:mysql_port');
    
    -- configure an external access integration
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION mysql_access_integration
        ALLOWED_NETWORK_RULES = (mysql_network_rule)
        ALLOWED_AUTHENTICATION_SECRETS = (mysql_secret)
        ENABLED = true;
    
    Copy
  4. Snowflake Notebooks の外部アクセスの設定, and then restart the notebook session.

  5. Use the DB-API to pull data from Oracle in a Python cell of a Snowflake notebook:

    # Get user name and password from ora_secret
    
    import _snowflake
    username_password_object = _snowflake.get_username_password('snowflake-secret-object')
    USER = username_password_object.username
    PASSWORD = username_password_object.password
    
    import snowflake.snowpark.context
    session = snowflake.snowpark.context.get_active_session()
    
    # Define the factory method for creating a connection to Oracle
    
    def create_oracle_db_connection():
        import oracledb
        host = "ora_host"
        port = "ora_port"
        service_name = "ora_service"
        user = USER
        password = PASSWORD
        DSN = f"{host}:{port}/{service_name}"
        connection = oracledb.connect(
            user=USER,
            password=PASSWORD,
            dsn=DSN,
        )
        # Optional: include this parameter for source tracing
        connection.clientinfo = "snowflake-snowpark-python"
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_oracle_db_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_oracle_db_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_oracle_db_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_oracle_db_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    # Save data into sf_table
    
    df_ora.write.mode("overwrite").save_as_table('sf_table')
    
    Copy

Source tracing when using DB-API to connect to Oracle

  1. Create接続関数にSnowparkのタグを含めます。

    def create_oracle_db_connection():
        import oracledb
        HOST = "myhost"
        PORT = "myport"
        SERVICE_NAME = "myservice"
        USER = "myuser"
        PASSWORD = "mypassword"
        DSN = f"{HOST}:{PORT}/{SERVICE_NAME}"
        connection = oracledb.connect(
            user=USER,
            password=PASSWORD,
            dsn=DSN,
        )
        # include this parameter for source tracing
        connection.clientinfo = "snowflake-snowpark-python"
        return connection
    
    Copy
  2. データソースで以下の SQL を実行し、まだ存続しているSnowparkからのクエリをキャプチャします。

    SELECT
        s.sid,
        s.serial#,
        s.username,
        s.module,
        q.sql_id,
        q.sql_text,
        q.last_active_time
    FROM
        v$session s
        JOIN v$sql q ON s.sql_id = q.sql_id
    WHERE
        s.client_info = 'snowflake-snowpark-python'
    
    Copy

PostgreSQL

To connect to PostgreSQL from Snowpark, you need the following two packages:

The following code examples show how to connect to PostgreSQL from a Snowpark client, stored procedures, and a Snowflake notebook.

Use the DB-API to connect to PostgreSQL from a Snowpark client

  1. Install psycopg2:

    pip install psycopg2
    
    Copy
  2. Define the factory method for creating a connection to PostgreSQL:

    def create_pg_connection():
        import psycopg2
        connection = psycopg2.connect(
            host="pg_host",
            port=pg_port,
            dbname="pg_dbname",
            user="pg_user",
            password="pg_password",
            # Optional: include this parameter for source tracing
            application_name="snowflake-snowpark-python"
        )
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_pg_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_pg_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_pg_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # Swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_pg_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    Copy

DB-API を使用して、ストアドプロシージャから PostgreSQL に接続します

  1. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  2. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    -- configure a secret
    
    CREATE OR REPLACE SECRET pg_secret
        TYPE = PASSWORD
        USERNAME = 'pg_username'
        PASSWORD = 'pg_password';
    
    -- configure a network rule.
    
    CREATE OR REPLACE NETWORK RULE pg_network_rule
        MODE = EGRESS
        TYPE = HOST_PORT
        VALUE_LIST = ('pg_host:pg_port');
    
    -- configure an external access integration.
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION pg_access_integration
        ALLOWED_NETWORK_RULES = (pg_network_rule)
        ALLOWED_AUTHENTICATION_SECRETS = (pg_secret)
        ENABLED = true;
    
    Copy
  3. Use Snowpark Python DB-API to pull data from PostgreSQL in a Python stored procedure:

    CREATE OR REPLACE PROCEDURE sp_pg_dbapi()
        RETURNS TABLE()
        LANGUAGE PYTHON
        RUNTIME_VERSION='3.11'
        HANDLER='run'
        PACKAGES=('snowflake-snowpark-python', 'psycopg2')
        EXTERNAL_ACCESS_INTEGRATIONS = (pg_access_integration)
        SECRETS = ('cred' = pg_secret )
    AS $$
    
    # Get user name and password from pg_secret
    
    import _snowflake
    username_password_object = _snowflake.get_username_password('cred')
    USER = username_password_object.username
    PASSWORD = username_password_object.password
    
    # Define the factory method for creating a connection to PostgreSQL
    
    from snowflake.snowpark import Session
    
    def create_pg_connection():
        import psycopg2
        connection = psycopg2.connect(
            host="pg_host",
            port=pg_port,
            dbname="pg_dbname",
            user=USER,
            password=PASSWORD,
            # Optional: include this parameter for source tracing
            application_name="snowflake-snowpark-python"
        )
        return connection
    
    def run(session: Session):
    
        # Feel free to combine local/udtf ingestion and partition column/predicates
        # as stated in the understanding parallelism section
    
        # Call dbapi to pull data from target table
    
        df = session.read.dbapi(
            create_pg_connection,
            table="target_table"
        )
    
        # Call dbapi to pull data from target query
    
        df_query = session.read.dbapi(
            create_pg_connection,
            query="select * from target_table"
        )
    
        # Pull data from target table with parallelism using partition column
    
        df_local_par_column = session.read.dbapi(
            create_pg_connection,
            table="target_table",
            fetch_size=100000,
            num_partitions=4,
            column="ID",  # swap with the column you want your partition based on
            upper_bound=10000,
            lower_bound=0
        )
    
        udtf_configs = {
            "external_access_integration": "<your external access integration>"
        }
    
        # Pull data from target table with udtf ingestion with parallelism using predicates
    
        df_udtf_predicates = session.read.dbapi(
            create_pg_connection,
            table="target_table",
            udtf_configs=udtf_configs,
            fetch_size=100000,
            predicates=[
                "ID < 3",
                "ID >= 3"
            ]
        )
        return df
    
    $$;
    CALL sp_pg_dbapi();
    
    Copy

Using DB-API to connect to PostgreSQL from a Snowflake notebook

  1. From Snowflake Notebook packages, select snowflake-snowpark-python and psycopg2.

  2. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  3. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    -- Configure the secret
    
    CREATE OR REPLACE SECRET pg_secret
        TYPE = PASSWORD
        USERNAME = 'pg_username'
        PASSWORD = 'pg_password';
    
    ALTER NOTEBOOK pg_notebook SET SECRETS = ('snowflake-secret-object' = pg_secret);
    
    -- Configure the network rule to allow egress to the source endpoint
    
    CREATE OR REPLACE NETWORK RULE pg_network_rule
        MODE = EGRESS
        TYPE = HOST_PORT
        VALUE_LIST = ('pg_host:pg_port');
    
    -- Configure external access integration
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION pg_access_integration
        ALLOWED_NETWORK_RULES = (pg_network_rule)
        ALLOWED_AUTHENTICATION_SECRETS = (pg_secret)
        ENABLED = true;
    
    Copy
  4. Snowflake Notebooks の外部アクセスの設定, and then restart the notebook session.

  5. Use the DB-API to pull data from PostgreSQL in a Python cell of a Snowflake notebook:

    # Get the user name and password from :code:`pg_secret`
    
    import _snowflake
    username_password_object = _snowflake.get_username_password('snowflake-secret-object')
    USER = username_password_object.username
    PASSWORD = username_password_object.password
    
    import snowflake.snowpark.context
    session = snowflake.snowpark.context.get_active_session()
    
    # Define the factory method for creating a connection to PostgreSQL
    
    def create_pg_connection():
        import psycopg2
        connection = psycopg2.connect(
            host="pg_host",
            port=pg_port,
            dbname="pg_dbname",
            user=USER,
            password=PASSWORD,
            # Optional: include this parameter for source tracing
            application_name="snowflake-snowpark-python"
        )
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_pg_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_pg_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_pg_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_pg_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    # Save data into sf_table
    
    df.write.mode("overwrite").save_as_table('sf_table')
    # Get the user name and password from :code:`pg_secret`
    
    Copy

Source tracing when using DB-API to connect to PostgreSQL

  1. Create接続関数にSnowparkのタグを含めます。

    def create_pg_connection():
        import psycopg2
        connection = psycopg2.connect(
            host="pg_host",
            port=pg_port,
            dbname="pg_dbname",
            user="pg_user",
            password="pg_password",
            # Include this parameter for source tracing
            application_name="snowflake-snowpark-python"
        )
        return connection
    
    Copy
  2. データソースで以下の SQL を実行し、まだ存続しているSnowparkからのクエリをキャプチャします。

    SELECT
        pid,
        usename AS username,
        datname AS database,
        application_name,
        client_addr,
        state,
        query_start,
        query
    FROM
        pg_stat_activity
    WHERE
        application_name = 'snowflake-snowpark-python';
    
    Copy

MySQL

To connect to MySQL from Snowpark, you need the following two packages:

The following code examples show how to connect to MySQL from a Snowpark client, stored procedures, and a Snowflake notebook.

Use the DB-API to connect to MySQL from a Snowpark client

  1. Install pymysql:

    pip install snowflake-snowpark-python[pandas]
    pip install pymysql
    
    Copy
  2. Define the factory method for creating a connection to MySQL:

    def create_mysql_connection():
        import pymysql
        connection = pymysql.connect(
            host="mysql_host",
            port=mysql_port,
            database="mysql_db",
            user="mysql_user",
            password="mysql_password",
            # Optional: include this parameter for source tracing
            init_command="SET @program_name='snowflake-snowpark-python';"
        )
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_mysql_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_mysql_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_mysql_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_mysql_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    Copy

DB-API を使用して、ストアドプロシージャから MySQL に接続します

  1. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  2. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    CREATE OR REPLACE SECRET mysql_secret
        TYPE = PASSWORD
        USERNAME = 'mysql_username'
        PASSWORD = 'mysql_password';
    
    -- configure a network rule.
    
    CREATE OR REPLACE NETWORK RULE mysql_network_rule
        MODE = EGRESS
        TYPE = HOST_PORT
        VALUE_LIST = ('mysql_host:mysql_port');
    
    -- configure an external access integration
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION mysql_access_integration
        ALLOWED_NETWORK_RULES = (mysql_network_rule)
        ALLOWED_AUTHENTICATION_SECRETS = (mysql_secret)
            ENABLED = true;
    
    Copy
  3. Use the Snowpark Python DB-API to pull data from MySQL in a Python stored procedure:

    CREATE OR REPLACE PROCEDURE sp_mysql_dbapi()
        RETURNS TABLE()
        LANGUAGE PYTHON
        RUNTIME_VERSION='3.11'
        HANDLER='run'
        PACKAGES=('snowflake-snowpark-python', 'pymysql')
        EXTERNAL_ACCESS_INTEGRATIONS = (mysql_access_integration)
        SECRETS = ('cred' = mysql_secret )
    AS $$
    
    # Get user name and password from mysql_secret
    
    import _snowflake
        username_password_object = _snowflake.get_username_password('cred')
        USER = username_password_object.username
        PASSWORD = username_password_object.password
    
    # Define the factory method for creating a connection to MySQL
    
    from snowflake.snowpark import session
    
    def create_mysql_connection():
        import pymysql
        connection = pymysql.connect(
            host="mysql_host",
            port=mysql_port,
            dbname="mysql_dbname",
            user=USER,
            password=PASSWORD,
            # Optional: include this parameter for source tracing
            init_command="SET @program_name='snowflake-snowpark-python';"
        )
        return connection
    
    # Using Snowpark Python DB-API to pull data from MySQL in a Python stored procedure.
    
    def run(session: Session):
        # Feel free to combine local/udtf ingestion and partition column/predicates
        # as stated in the understanding parallelism section
    
        # Call dbapi to pull data from target table
    
        df = session.read.dbapi(
            create_mysql_connection,
            table="target_table"
        )
    
        # Call dbapi to pull data from target query
    
        df_query = session.read.dbapi(
            create_mysql_connection,
            query="select * from target_table"
        )
    
        # Pull data from target table with parallelism using partition column
    
        df_local_par_column = session.read.dbapi(
            create_mysql_connection,
            table="target_table",
            fetch_size=100000,
            num_partitions=4,
            column="ID",  # swap with the column you want your partition based on
            upper_bound=10000,
            lower_bound=0
        )
    
        udtf_configs = {
            "external_access_integration": "<your external access integration>"
        }
    
        # Pull data from target table with udtf ingestion with parallelism using predicates
    
        df_udtf_predicates = session.read.dbapi(
            create_mysql_connection,
            table="target_table",
            udtf_configs=udtf_configs,
            fetch_size=100000,
            predicates=[
                "ID < 3",
                "ID >= 3"
            ]
        )
        return df
    $$;
    
    CALL sp_mysql_dbapi();
    
    Copy

Using DB-API to connect to MySQL from a Snowflake notebook

  1. From Snowflake Notebook packages, select snowflake-snowpark-python and pymysql.

  2. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  3. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    CREATE OR REPLACE SECRET mysql_secret
        TYPE = PASSWORD
        USERNAME = 'mysql_username'
        PASSWORD = 'mysql_password';
    
    ALTER NOTEBOOK mynotebook SET SECRETS = ('snowflake-secret-object' = mysql_secret);
    
    -- configure a network rule.
    CREATE OR REPLACE NETWORK RULE mysql_network_rule
        MODE = EGRESS
        TYPE = HOST_PORT
        VALUE_LIST = ('mysql_host:mysql_port');
    
    -- configure an EAI
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION mysql_access_integration
        ALLOWED_NETWORK_RULES = (mysql_network_rule)
        ALLOWED_AUTHENTICATION_SECRETS = (mysql_secret)
        ENABLED = true;
    
    Copy
  4. Snowflake Notebooks の外部アクセスの設定, and then restart the notebook session.

  5. Use the DB-API to pull data from MySQL in a Python cell of a Snowflake notebook:

    # Get user name and password from mysql_secret
    import _snowflake
    username_password_object = _snowflake.get_username_password('snowflake-secret-object')
    USER = username_password_object.username
    PASSWORD = username_password_object.password
    
    import snowflake.snowpark.context
    session = snowflake.snowpark.context.get_active_session()
    
    # Define the factory method for creating a connection to MySQL
    
    def create_mysql_connection():
        import pymysql
        connection = pymysql.connect(
            host="mysql_host",
            port=mysql_port,
            dbname="mysql_dbname",
            user=USER,
            password=PASSWORD,
            # Optional: include this parameter for source tracing
            init_command="SET @program_name='snowflake-snowpark-python';"
        )
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_mysql_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_mysql_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_mysql_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_mysql_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    # Save data into sf_table
    
    df.write.mode("overwrite").save_as_table('sf_table')
    
    Copy

DB-API を使用して MySQL に接続する場合のソーストレース

  1. Create接続関数にSnowparkのタグを含めます。

    def create_mysql_connection():
        import pymysql
        connection = pymysql.connect(
            host="mysql_host",
            port=mysql_port,
            database="mysql_db",
            user="mysql_user",
            password="mysql_password",
            # include this parameter for source tracing
            init_command="SET @program_name='snowflake-snowpark-python';"
        )
        return connection
    
    Copy
  2. データソースで以下の SQL を実行し、Snowparkからのクエリをキャプチャします。

    SELECT *
    FROM performance_schema.events_statements_history_long
    WHERE THREAD_ID = (
        SELECT THREAD_ID
        FROM performance_schema.events_statements_history_long
        WHERE SQL_TEXT = "SET @program_name='snowflake-snowpark-python'"
        ORDER BY EVENT_ID DESC
        LIMIT 1
    )
    
    Copy

Databricks

To connect to Databricks from Snowpark, you need the following two packages:

The following code examples show how to connect to Databricks from a Snowpark client, stored procedures, and a Snowflake notebook.

DB-APIを使用して、SnowparkクライアントからDatabricksに接続します

  1. Install databricks-sql-connector:

    pip install snowflake-snowpark-python[pandas]
    pip install databricks-sql-connector
    
    Copy
  2. Define the factory method for creating a connection to Databricks:

    def create_dbx_connection():
        import databricks.sql
        connection = databricks.sql.connect(
            server_hostname=HOST,
            http_path=PATH,
            access_token=ACCESS_TOKEN
        )
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_dbx_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_dbx_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_dbx_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_dbx_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    Copy

DB-API を使用して、ストアドプロシージャからDatabricksに接続します

  1. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  2. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    CREATE OR REPLACE SECRET dbx_secret
        TYPE = GENERIC_STRING
        SECRET_STRING = 'dbx_access_token';
    
    CREATE OR REPLACE NETWORK RULE dbx_network_rule
        MODE = EGRESS
        TYPE = HOST_PORT
        VALUE_LIST = ('dbx_host:dbx_port');
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION dbx_access_integration
        ALLOWED_NETWORK_RULES = (dbx_network_rule)
        ALLOWED_AUTHENTICATION_SECRETS = (dbx_secret)
        ENABLED = true;
    
    Copy
  3. Use the Snowpark Python DB-API to pull data from Databricks in a Python stored procedure:

    CREATE OR REPLACE PROCEDURE sp_dbx_dbapi()
        RETURNS TABLE()
        LANGUAGE PYTHON
        RUNTIME_VERSION='3.11'
        HANDLER='run'
        PACKAGES=('snowflake-snowpark-python', 'databricks-sql-connector')
        EXTERNAL_ACCESS_INTEGRATIONS = (dbx_access_integration)
        SECRETS = ('cred' = dbx_secret )
    AS $$
    
    # Get user name and password from dbx_secret
    
    import _snowflake
    ACCESS_TOKEN = _snowflake.get_generic_secret_string('cred')
    
    from snowflake.snowpark import Session
    
    # Define the method for creating a connection to Databricks
    def create_dbx_connection():
        import databricks.sql
        connection = databricks.sql.connect(
            server_hostname="dbx_host",
            http_path="dbx_path",
            access_token=ACCESS_TOKEN,
        )
        return connection
    
    # Using Snowpark Python DB-API to pull data from DataBricks in a Python stored procedure.
    
    def run(session: Session):
        # Feel free to combine local/udtf ingestion and partition column/predicates
        # as stated in the understanding parallelism section
    
        # Call dbapi to pull data from target table
    
        df = session.read.dbapi(
            create_dbx_connection,
            table="target_table"
        )
    
        # Call dbapi to pull data from target query
    
        df_query = session.read.dbapi(
            create_dbx_connection,
            query="select * from target_table"
        )
    
        # Pull data from target table with parallelism using partition column
    
        df_local_par_column = session.read.dbapi(
            create_dbx_connection,
            table="target_table",
            fetch_size=100000,
            num_partitions=4,
            column="ID",  # swap with the column you want your partition based on
            upper_bound=10000,
            lower_bound=0
        )
    
        udtf_configs = {
            "external_access_integration": "<your external access integration>"
        }
    
        # Pull data from target table with udtf ingestion with parallelism using predicates
    
        df_udtf_predicates = session.read.dbapi(
            create_dbx_connection,
            table="target_table",
            udtf_configs=udtf_configs,
            fetch_size=100000,
            predicates=[
                "ID < 3",
                "ID >= 3"
            ]
        )
        return df
    
    $$;
    
    CALL sp_dbx_dbapi();
    
    Copy

Using DB-API to connect to Databricks from a Snowflake notebook

  1. Snowflake Notebookパッケージsnowflake-snowpark-python および databricks-sql-connector を選択します。

  2. Configure an external access integration (EAI), which is required to allow Snowflake to connect to the source endpoint.

    注釈

    PrivateLink is recommended for secure data transfer, especially when you're dealing with sensitive information. Ensure that your Snowflake account has the necessary PrivateLink privileges enabled and that the PrivateLink feature is configured and active in your Snowflake Notebook environment.

  3. Configure the secret, a network rule to allow egress to the source endpoint, and EAI:

    CREATE OR REPLACE SECRET dbx_secret
    TYPE = GENERIC_STRING
    SECRET_STRING = 'dbx_access_token';
    
    ALTER NOTEBOOK mynotebook SET SECRETS = ('snowflake-secret-object' = dbx_secret);
    
    CREATE OR REPLACE NETWORK RULE dbx_network_rule
    MODE = EGRESS
    TYPE = HOST_PORT
    VALUE_LIST = ('dbx_host:dbx_port');
    
    CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION dbx_access_integration
    ALLOWED_NETWORK_RULES = (dbx_network_rule)
    ALLOWED_AUTHENTICATION_SECRETS = (dbx_secret)
    ENABLED = true;
    
    Copy
  4. Snowflake Notebooks の外部アクセスの設定, and then restart the notebook session.

  5. Use the DB-API to pull data from Databricks in a Python cell of a Snowflake notebook:

    # Get user name and password from dbx_secret
    
    import _snowflake
    ACCESS_TOKEN = _snowflake.get_generic_secret_string('cred')
    
    import snowflake.snowpark.context
    session = snowflake.snowpark.context.get_active_session()
    
    # Define the factory method for creating a connection to Databricks
    
    def create_dbx_connection():
        import databricks.sql
        connection = databricks.sql.connect(
            server_hostname="dbx_host",
            http_path="dbx_path",
            access_token=ACCESS_TOKEN,
        )
        return connection
    
    # Feel free to combine local/udtf ingestion and partition column/predicates as
    # stated in the understanding parallelism section
    
    # Call dbapi to pull data from target table
    
    df = session.read.dbapi(
        create_dbx_connection,
        table="target_table"
    )
    
    # Call dbapi to pull data from target query
    
    df_query = session.read.dbapi(
        create_dbx_connection,
        query="select * from target_table"
    )
    
    # Pull data from target table with parallelism using partition column
    
    df_local_par_column = session.read.dbapi(
        create_dbx_connection,
        table="target_table",
        fetch_size=100000,
        num_partitions=4,
        column="ID",  # swap with the column you want your partition based on
        upper_bound=10000,
        lower_bound=0
    )
    
    udtf_configs = {
        "external_access_integration": "<your external access integration>"
    }
    
    # Pull data from target table with udtf ingestion with parallelism using predicates
    
    df_udtf_predicates = session.read.dbapi(
        create_dbx_connection,
        table="target_table",
        udtf_configs=udtf_configs,
        fetch_size=100000,
        predicates=[
            "ID < 3",
            "ID >= 3"
        ]
    )
    
    # Save data into sf_table
    
    df.write.mode("overwrite").save_as_table('sf_table')
    
    Copy

Source tracing when using DB-API to connect to Databricks

  1. Create接続関数にSnowparkのタグを含めます。

    def create_dbx_connection():
        import databricks.sql
        connection = databricks.sql.connect(
            server_hostname=HOST,
            http_path=PATH,
            access_token=ACCESS_TOKEN,
            # include this parameter for source tracing
            user_agent_entry="snowflake-snowpark-python"
        )
        return connection
    
    Copy
  2. DataBricks コンソールでクエリ履歴に移動し、ソースが snowflake-snowpark-python であるクエリを検索します。

制限事項

The Snowpark Python DB-API supports only Python DB-API 2.0--compliant drivers (for example, pyodbc or oracledb). JDBC drivers are not supported in this release.