Snowpark ML Data Connector¶
The DataConnector module provides a unified interface for reading Snowflake data and converting it into formats compatible with popular machine learning frameworks like PyTorch and TensorFlow.
DataConnector automatically uses optimized data ingestion when running in the Container Runtime for ML.
Note
This topic assumes that the Snowpark ML module is installed. If it isn’t, see Using Snowflake ML Locally.
Creating a DataConnector¶
You can create a DataConnector instance in several ways:
From a SQL query using
DataConnector.from_sql()
From a Snowpark DataFrame using
DataConnector.from_dataframe()
From a Snowflake Dataset using
DataConnector.from_dataset()
from snowflake.ml import dataset
from snowflake.ml.data import DataConnector
# Create a DataConnector from a SQL query
connector = DataConnector.from_sql("SELECT * FROM my_table", session=session)
# Create a DataConnector from a Snowpark DataFrame
df = session.table(my_table)
connector = DataConnector.from_dataframe(df)
# Create a DataConnector from a Snowflake Dataset
ds = dataset.load_dataset(session, "my_dataset", "v1")
connector = DataConnector.from_dataset(ds)
Using with PyTorch¶
For usage with PyTorch, use to_torch_dataset()
to get an IterableDataset
which can then be passed to a PyTorch DataLoader. The DataLoader iterates over the
data and yields batched PyTorch tensors. Data is loaded in a streaming fashion
for maximum efficiency.
from torch.utils.data import DataLoader
torch_ds = connector.to_torch_dataset(
batch_size=4,
shuffle=True,
drop_last_batch=True
)
for batch in DataLoader(torch_ds, batch_size=None, num_workers=0):
print(batch)
Using with TensorFlow¶
For usage with TensorFlow, use the to_tf_dataset()
method to get a Tensorflow Dataset:
Iterating over the Dataset yields batched TensorFlow tensors. Data is loaded in a streaming fashion
for maximum efficiency.
tf_ds = connector.to_tf_dataset(
batch_size=4,
shuffle=True,
drop_last_batch=True
)
for batch in tf_ds:
print(batch)
Data Processing Options¶
Shuffling¶
Pass shuffle=True
to randomly shuffle the data during ingestion.
This can help prevent overfitting during model training.
For a discussion of the value of shuffling, see
Why should the data be shuffled for machine learning tasks?
Batching¶
Use the batch_size
parameter to control the size of data batches.
The batching is handled efficiently at the data ingestion level.
When using with PyTorch DataLoaders, you must explicitly pass batch_size=None
when instantiating DataLoader
to prevent double batching.
See Using with PyTorch for an example of usage with DataLoader
You can also drop the last batch if it is incomplete by passing drop_last_batch=True
to to_torch_datapipe
or to
to_tf_dataset
.