Snowpark ML Data Connector

The DataConnector module provides a unified interface for reading Snowflake data and converting it into formats compatible with popular machine learning frameworks like PyTorch and TensorFlow.

DataConnector automatically uses optimized data ingestion when running in the Container Runtime for ML.

Note

This topic assumes that the Snowpark ML module is installed. If it isn’t, see Using Snowflake ML Locally.

Creating a DataConnector

You can create a DataConnector instance in several ways:

from snowflake.ml import dataset
from snowflake.ml.data import DataConnector

# Create a DataConnector from a SQL query
connector = DataConnector.from_sql("SELECT * FROM my_table", session=session)

# Create a DataConnector from a Snowpark DataFrame
df = session.table(my_table)
connector = DataConnector.from_dataframe(df)

# Create a DataConnector from a Snowflake Dataset
ds = dataset.load_dataset(session, "my_dataset", "v1")
connector = DataConnector.from_dataset(ds)
Copy

Using with PyTorch

For usage with PyTorch, use to_torch_dataset() to get an IterableDataset which can then be passed to a PyTorch DataLoader. The DataLoader iterates over the data and yields batched PyTorch tensors. Data is loaded in a streaming fashion for maximum efficiency.

from torch.utils.data import DataLoader

torch_ds = connector.to_torch_dataset(
    batch_size=4,
    shuffle=True,
    drop_last_batch=True
)

for batch in DataLoader(torch_ds, batch_size=None, num_workers=0):
    print(batch)
Copy

Using with TensorFlow

For usage with TensorFlow, use the to_tf_dataset() method to get a Tensorflow Dataset: Iterating over the Dataset yields batched TensorFlow tensors. Data is loaded in a streaming fashion for maximum efficiency.

tf_ds = connector.to_tf_dataset(
    batch_size=4,
    shuffle=True,
    drop_last_batch=True
)

for batch in tf_ds:
    print(batch)
Copy

Data Processing Options

Shuffling

Pass shuffle=True to randomly shuffle the data during ingestion. This can help prevent overfitting during model training. For a discussion of the value of shuffling, see Why should the data be shuffled for machine learning tasks?

Batching

Use the batch_size parameter to control the size of data batches. The batching is handled efficiently at the data ingestion level. When using with PyTorch DataLoaders, you must explicitly pass batch_size=None when instantiating DataLoader to prevent double batching. See Using with PyTorch for an example of usage with DataLoader

You can also drop the last batch if it is incomplete by passing drop_last_batch=True to to_torch_datapipe or to to_tf_dataset.