Snowpark ML FileSystem and FileSet¶
Note
Snowpark ML 1.5.0 introduced Dataset, an immutable, versioned snapshot designed for use in machine learning applications. For most use cases, it is superior to the FileSet API described in this topic. The FileSet API is still supported at this time, although it is a Preview feature and will not be made Generally Available.
The Snowpark ML library includes FileSystem, an abstraction that is similar to a file system for an internal, server-side encrypted Snowflake stage. Specifically, it is an fsspec AbstractFileSystem implementation. The library also includes FileSet, a related class that allows you to move machine learning data from a Snowflake table to the stage, and from there to feed the data to PyTorch or TensorFlow (see Snowpark ML Framework Connectors).
Tip
Most users should use the newer Dataset API for creating immutable, governed data snapshots in Snowflake and using them in end-to-end machine learning workflows.
Installation¶
The FileSystem and FileSet APIs are part of the Snowpark ML Python package, snowflake-ml-python
. See
Using Snowflake ML Locally for installation instructions.
Creating and Using a File System¶
Creating a Snowpark ML file system requires either a
Snowflake Connector for Python
Connection
object or a Snowpark Python
Session
. See Connecting to Snowflake for instructions.
After you have either a connection or a session, you can create a Snowpark ML SFFileSystem
instance through
which you can access data in your internal stage.
If you have a Snowflake Connector for Python connection, pass it as the sf_connection
argument:
import fsspec
from snowflake.ml.fileset import sfcfs
sf_fs1 = sfcfs.SFFileSystem(sf_connection=sf_connection)
If you have a Snowpark Python session, pass it as the snowpark_session
argument:
import fsspec
from snowflake.ml.fileset import sfcfs
sf_fs2 = sfcfs.SFFileSystem(snowpark_session=sp_session)
SFFileSystem
inherits many features from fsspec.FileSystem
, such as local caching of files. You can
enable this and other features by instantiating a Snowflake file system through the fsspec.filesystem
factory
function, passing target_protocol="sfc"
to use the Snowflake FileSystem implementation:
local_cache_path = "/tmp/sf_files/"
cached_fs = fsspec.filesystem("cached", target_protocol="sfc",
target_options={"sf_connection": sf_connection,
"cache_types": "bytes",
"block_size": 32 * 2**20},
cache_storage=local_cache_path)
The Snowflake file system supports most read-only methods defined for a fsspec FileSystem
, including find
, info
,
isdir
, isfile
, and exists
.
Specifying Files¶
To specify files in a stage, use a path in the form @database.schema.stage/file_path
.
Listing Files¶
The file system’s ls
method is used to get a list of the files in the stage:
print(*cached_fs.ls("@ML_DATASETS.public.my_models/sales_predict/"), end='\n')
Opening and Reading Files¶
You can open files in the stage by using the file system’s open
method. You can then read the files by using the same
methods you use with ordinary Python files. The file object is also a context manager that can be used with Python’s
with
statement, so it is automatically closed when it’s no longer needed.
path = '@ML_DATASETS.public.my_models/test/data_7_7_3.snappy.parquet'
with sf_fs1.open(path, mode='rb') as f:
print(f.read(16))
You can also use the SFFileSystem
instance with other components that accept fsspec file systems. Here, the Parquet data file mentioned in the previous code block is passed to PyArrow’s read_table
method:
import pyarrow.parquet as pq
table = pq.read_table(path, filesystem=sf_fs1)
table.take([1, 3])
Python components that accept files (or file-like objects) can be passed a file object opened from the Snowflake file
system. For example, if you have a gzip-compressed file in your stage, you can use it with Python’s gzip
module
by passing it to gzip.GzipFile
as the fileobj
parameter:
path = "sfc://@ML_DATASETS.public.my_models/dataset.csv.gz"
with cached_fs.open(path, mode='rb', sf_connection=sf_connection) as f:
g = gzip.GzipFile(fileobj=f)
for i in range(3):
print(g.readline())
Creating and Using a FileSet¶
A Snowflake FileSet represents an immutable snapshot of the result of a SQL query in the form of files in an internal
server-side encrypted stage. These files can be accessed through a FileSystem to feed data to tools such as PyTorch and
TensorFlow so that you can train models at scale and within your existing data governance model. To create a FileSet,
use the FileSet.make
method.
You need a Snowflake Python connection or a Snowpark session to create a FileSet. See Connecting to Snowflake for instructions. You must also provide the path to an existing internal server-side encrypted stage, or a subdirectory under such a stage, where the FileSet will be stored.
To create a FileSet from a Snowpark DataFrame, construct a DataFrame
and pass it to FileSet.make
as snowpark_dataframe
; do not call the DataFrame’s collect
method:
# Snowpark Python equivalent of "SELECT * FROM MYDATA LIMIT 5000000"
df = snowpark_session.table('mydata').limit(5000000)
fileset_df = fileset.FileSet.make(
target_stage_loc="@ML_DATASETS.public.my_models/",
name="from_dataframe",
snowpark_dataframe=df,
shuffle=True,
)
To create a FileSet using a Snowflake Connector for Python connection, pass the connection to Fileset.make
as
sf_connection
, and pass the SQL query as query
:
fileset_sf = fileset.FileSet.make(
target_stage_loc="@ML_DATASETS.public.my_models/",
name="from_connector",
sf_connection=sf_connection,
query="SELECT * FROM MYDATA LIMIT 5000000",
shuffle=True, # see later section about shuffling
)
Note
See Shuffling Data in FileSets for information about shuffling your data by using the shuffle
parameter.
Use the files
method to get a list of the files in the FileSet:
print(*fileset_df.files())
For information about feeding the data in the FileSet to PyTorch or TensorFlow, see Snowpark ML Framework Connectors.