Snowpark ML Framework Connectors¶
Snowpark ML includes support for secure, scalable data provisioning for the PyTorch and Tensorflow frameworks, both of which expect data in their own specific formats. To simplify this workflow, the Snowpark ML library provides convenient methods built on top of the FileSet API to provide data from a FileSet as PyTorch Tensors or TensorFlow Records. (A FileSet represents an immutable snapshot of the result of a SQL query in the form of files in an internal server-side encrypted stage.)
This topic assumes that the Snowpark ML module is installed. If it isn’t, see Installing Snowpark ML.
Creating a FileSet from a Query¶
Refer to Creating and Using a FileSet for information on creating a FileSet from the data you want to use with PyTorch or TensorFlow. Then continue to one of the following sections:
Feeding a FileSet to PyTorch¶
From a Snowflake FileSet, you can get a PyTorch DataPipe, which can be passed to a PyTorch DataLoader. The DataLoader
iterates over the FileSet data and yields batched PyTorch tensors. Create the DataPipe using the FileSet’s
to_torch_datapipe method, and then pass the DataPipe to PyTorch’s
from torch.utils.data import DataLoader # See later sections about shuffling and batching pipe = fileset_df.to_torch_datapipe( batch_size=4, shuffle=True, drop_last_batch=True) for batch in DataLoader(pipe, batch_size=None, num_workers=0): print(batch) break
Feeding a FileSet to TensorFlow¶
You can get a TensorFlow Dataset from a Snowflake FileSet using the FileSet’s
import tensorflow as tf # See following sections about shuffling and batching ds = fileset_df.to_tf_dataset( batch_size=4, shuffle=True, drop_last_batch=True) for batch in ds: print(batch) break
Iterating over the Dataset yields batched tensors.
Shuffling Data in FileSets¶
It is often valuable to shuffle the training data to avoid overfitting and other issues. For a discussion of the value of shuffling, see Why should the data be shuffled for machine learning tasks?
If your query does not already shuffle your data sufficiently, a FileSet can shuffle data at two points:
When the FileSet is created by using
All rows in your query are shuffled before they are written to the FileSet. This is a high-quality global shuffle and can be expensive with large datasets. Therefore, it is performed only once, when materializing the FileSet. Pass
shuffle=Trueas a keyword argument to
When you create a PyTorch DataPipe or a TensorFlow Dataset from a FileSet.
At this point, the order of the files in the FileSet is randomized, as is the order of the rows in each file. This can be considered an “approximate” global shuffle. It is of lower quality than a true global shuffle, but it is much less expensive. To shuffle at this stage, pass
shuffle=Trueas a keyword argument to the FileSet’s
For best results, shuffle twice: when creating the FileSet and when feeding the data to PyTorch or TensorFlow.
Batching Data in FileSets¶
FileSets have a batching feature that works the same as the batching functionality in PyTorch and TensorFlow but is
more efficient. Snowflake recommends that you use the
batch_size parameter in the FileSet’s
to_tf_dataset methods instead of having PyTorch or TensorFlow do the batching. With PyTorch, to disable its batching functionality, you must
batch_size=None when instantiating
You can also drop the last batch if it is incomplete by passing
to_torch_datapipe or to