snowflake.ml.fileset.fileset.FileSet

class snowflake.ml.fileset.fileset.FileSet(*, target_stage_loc: str, name: str, sf_connection: Optional[SnowflakeConnection] = None, snowpark_session: Optional[Session] = None)

Bases: object

A FileSet represents an immutable snapshot of the result of a query in the form of files.

Create a FileSet based on an existing stage directory.

It can be used to restore an existing FileSet that was not deleted before.

Parameters:
  • sf_connection – A Snowflake python connection object. Mutually exclusive to snowpark_session.

  • snowpark_session – A Snowpark Session object. Mutually exclusive to sf_connection.

  • target_stage_loc – A string of the Snowflake stage path where the FileSet will be stored. It needs to be an absolute path with the form of “@{database}.{schema}.{stage}/{optional directory}/”.

  • name – The name of the FileSet. It is the name of the directory which holds result stage files.

Raises:

SnowflakeMLException – An error occurred when not exactly one of sf_connection and snowpark_session is given.

Example:

>>> # Create a new FileSet using Snowflake Python connection
>>> conn = snowflake.connector.connect(**connection_parameters)
>>> my_fileset = snowflake.ml.fileset.FileSet.make(
>>>     target_stage_loc="@mydb.mychema.mystage/mydir"
>>>     name="helloworld",
>>>     sf_connection=conn,
>>>     query="SELECT * FROM Mytable limit 1000000",
>>> )
>>> my_fileset.files()
['sfc://@mydb.myschema.mystage//mydir/helloworld/data_0_0_0.snappy.parquet']
>>> # Now we can restore the FileSet in another program as long as the FileSet is not deleted
>>> conn = snowflake.connector.connect(**connection_parameters)
>>> my_fileset_pointer = FileSet(sf_connection=conn,
                                 target_stage_loc="@mydb.mychema.mystage/mydir",
                                 name="helloworld")
>>> my_fileset.files()
['sfc://@mydb.myschema.mystage/mydir/helloworld/data_0_0_0.snappy.parquet']

Methods

delete() None

Delete the FileSet directory and all the stage files in it.

If not called, the FileSet and all its stage files will stay in Snowflake stage.

Raises:

SnowflakeMLException – An error occurred when the FileSet cannot get deleted.

This function or method is in private preview since 0.2.0.

files() list[str]

Get the list of stage file paths in the current FileSet.

The stage file paths follows the sfc protocol.

Returns:

A list of stage file paths

Example:

>>> my_fileset = FileSet(sf_connection=conn, target_stage_loc="@mydb.mychema.mystage", name="test")
>>> my_fileset.files()
["sfc://@mydb.myschema.mystage/test/hello_world_0_0_0.snappy.parquet",
 "sfc://@mydb.myschema.mystage/test/hello_world_0_0_1.snappy.parquet"]

This function or method is in private preview since 0.2.0.

fileset_stage_location() str

Get the stage path to the current FileSet in sfc protocol.

Returns:

A string representing the stage path

Example:

>>> my_fileset = FileSet(sf_connection=conn, target_stage_loc="@mydb.mychema.mystage", name="test")
>>> my_fileset.files()
"sfc://@mydb.myschema.mystage/test/

This function or method is in private preview since 0.2.0.

classmethod make(*, target_stage_loc: str, name: str, snowpark_dataframe: Optional[DataFrame] = None, sf_connection: Optional[SnowflakeConnection] = None, query: str = '', shuffle: bool = False) FileSet

Creates a FileSet object given a SQL query.

The result FileSet object captures the query result deterministically as stage files.

Parameters:
  • target_stage_loc – A string of the Snowflake stage path where the FileSet will be stored. It needs to be an absolute path with the form of “@{database}.{schema}.{stage}/{optional directory}/”.

  • name – The name of the FileSet. It will become the name of the directory which holds result stage files. If there is already a FileSet with the same name in the given stage location, an exception will be raised.

  • snowpark_dataframe – A Snowpark Dataframe. Mutually exclusive to (sf_connection, query).

  • sf_connection – A Snowflake python connection object. Must be provided if query is provided.

  • query – A string of Snowflake SQL query to be executed. Mutually exclusive to snowpark_dataframe. Must also specify sf_connection.

  • shuffle – A boolean represents whether the data should be shuffled globally. Default to be false.

Returns:

A FileSet object.

Raises:
  • ValueError – An error occurred when not exactly one of sf_connection and snowpark_session is given.

  • FileSetExistError – An error occurred whern a FileSet with the same name exists in the given path.

  • FileSetError – An error occurred when the SQL query/dataframe is not able to get materialized.

Note: During the generation of stage files, data casting will occur. The casting rules are as follows::
  • Data casting:
    • DecimalType(NUMBER):
      • If its scale is zero, cast to BIGINT

      • If its scale is non-zero, cast to FLOAT

    • DoubleType(DOUBLE): Cast to FLOAT.

    • ByteType(TINYINT): Cast to SMALLINT.

    • ShortType(SMALLINT):Cast to SMALLINT.

    • IntegerType(INT): Cast to INT.

    • LongType(BIGINT): Cast to BIGINT.

  • No action:
    • FloatType(FLOAT): No action.

    • StringType(String): No action.

    • BinaryType(BINARY): No action.

    • BooleanType(BOOLEAN): No action.

  • Not supported:
    • ArrayType(ARRAY): Not supported. A warning will be logged.

    • MapType(OBJECT): Not supported. A warning will be logged.

    • TimestampType(TIMESTAMP): Not supported. A warning will be logged.

    • TimeType(TIME): Not supported. A warning will be logged.

    • DateType(DATE): Not supported. A warning will be logged.

    • VariantType(VARIANT): Not supported. A warning will be logged.

Example 1: Create a FileSet with Snowflake Python connection

>>> conn = snowflake.connector.connect(**connection_parameters)
>>> my_fileset = snowflake.ml.fileset.FileSet.make(
>>>     target_stage_loc="@mydb.mychema.mystage/mydir"
>>>     name="helloworld",
>>>     sf_connection=conn,
>>>     query="SELECT * FROM mytable limit 1000000",
>>> )
>>> my_fileset.files()
['sfc://@mydb.myschema.mystage/helloworld/data_0_0_0.snappy.parquet']

Example 2: Create a FileSet with a Snowpark dataframe

>>> new_session = snowflake.snowpark.Session.builder.configs(connection_parameters).create()
>>> df = new_session.sql("SELECT * FROM Mytable limit 1000000")
>>> my_fileset = snowflake.ml.fileset.FileSet.make(
>>>     target_stage_loc="@mydb.mychema.mystage/mydir"
>>>     name="helloworld",
>>>     snowpark_dataframe=df,
>>> )
>>> my_fileset.files()
['sfc://@mydb.myschema.mystage/helloworld/data_0_0_0.snappy.parquet']
to_snowpark_dataframe() DataFrame

Convert the fileset to a snowpark dataframe.

Only parquet files that owned by the FileSet will be read and converted. The parquet files that materialized by FileSet have the name pattern “data_<query_id>_<some_sharding_order>.snappy.parquet”.

Returns:

A Snowpark dataframe that contains the data of this FileSet.

Note: The dataframe generated by this method might not have the same schema as the original one. Specifically,
  • NUMBER type with scale != 0 will become float.

  • Unsupported types (see comments of make()) will not have any guarantee.

    For example, an OBJECT column may be scanned back as a STRING column.

This function or method is in private preview since 0.2.0.

to_tf_dataset(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) Any

Transform the Snowflake data into a ready-to-use TensorFlow tf.data.Dataset.

Parameters:
  • batch_size – It specifies the size of each data batch which will be yield in the result datapipe

  • shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.

  • drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A tf.data.Dataset that yields batched tf.Tensors.

Examples:

>>> conn = snowflake.connector.connect(**connection_parameters)
>>> fileset = FileSet.make(
>>>     sf_connection=conn, name="helloworld", target_stage_loc="@mydb.myschema.mystage"
>>>     query="SELECT * FROM Mytable"
>>> )
>>> dp = fileset.to_tf_dataset(batch_size=1)
>>> for data in dp:
>>>     print(data)
----
{'_COL_1': <tf.Tensor: shape=(1,), dtype=int64, numpy=[10]>}

This function or method is in private preview since 0.2.0.

to_torch_datapipe(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) Any

Transform the Snowflake data into a ready-to-use Pytorch datapipe.

Return a Pytorch datapipe which iterates on rows of data.

Parameters:
  • batch_size – It specifies the size of each data batch which will be yield in the result datapipe

  • shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.

  • drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A Pytorch iterable datapipe that yield data.

Examples:

>>> conn = snowflake.connector.connect(**connection_parameters)
>>> fileset = FileSet.make(
>>>     sf_connection=conn, name="helloworld", target_stage_loc="@mydb.myschema.mystage"
>>>     query="SELECT * FROM Mytable"
>>> )
>>> dp = fileset.to_torch_datapipe(batch_size=1)
>>> for data in dp:
>>>     print(data)
----
{'_COL_1':[10]}

This function or method is in private preview since 0.2.0.

Attributes

name

Get the name of the FileSet.