snowflake.ml.dataset.DatasetReader

class snowflake.ml.dataset.DatasetReader(ingestor: DataIngestor, *, snowpark_session: Session)

Bases: DataConnector

Snowflake Dataset abstraction which provides application integration connectors

Methods

files() List[str]

Get the list of remote file paths for the current DatasetVersion.

The file paths follows the snow protocol.

Returns:

A list of remote file paths

Example: >>> dsv.files() —- [“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_0.snappy.parquet”,

“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_1.snappy.parquet”]

filesystem() SnowFileSystem

Return an fsspec FileSystem which can be used to load the DatasetVersion’s files()

classmethod from_dataframe(df: DataFrame, ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) DatasetReader

This function or method is in private preview since 1.6.0.

classmethod from_dataset(ds: Dataset, ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) DataConnectorType
classmethod from_sources(session: Session, sources: List[Union[DataFrameInfo, DatasetInfo, str]], ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) DataConnectorType
to_pandas(limit: Optional[int] = None) pd.DataFrame

Retrieve the Snowflake data as a Pandas DataFrame.

Parameters:

limit – If specified, the maximum number of rows to load into the DataFrame.

Returns:

A Pandas DataFrame.

to_snowpark_dataframe(only_feature_cols: bool = False) DataFrame

Convert the DatasetVersion to a Snowpark DataFrame.

Parameters:

only_feature_cols – If True, drops exclude_cols and label_cols from returned DataFrame. The original DatasetVersion is unaffected.

Returns:

A Snowpark dataframe that contains the data of this DatasetVersion.

Note: The dataframe generated by this method might not have the same schema as the original one. Specifically,
  • NUMBER type with scale != 0 will become float.

  • Unsupported types (see comments of Dataset.create_version()) will not have any guarantee.

    For example, an OBJECT column may be scanned back as a STRING column.

to_tf_dataset(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) tf.data.Dataset

Transform the Snowflake data into a ready-to-use TensorFlow tf.data.Dataset.

Parameters:
  • batch_size – It specifies the size of each data batch which will be yield in the result datapipe

  • shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.

  • drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A tf.data.Dataset that yields batched tf.Tensors.

to_torch_datapipe(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) torch_data.IterDataPipe

Transform the Snowflake data into a ready-to-use Pytorch datapipe.

Return a Pytorch datapipe which iterates on rows of data.

Parameters:
  • batch_size – It specifies the size of each data batch which will be yield in the result datapipe

  • shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.

  • drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A Pytorch iterable datapipe that yield data.

to_torch_dataset(*, shuffle: bool = False) torch_data.IterableDataset

Transform the Snowflake data into a PyTorch Iterable Dataset to be used with a DataLoader.

Return a PyTorch Dataset which iterates on rows of data.

Parameters:

shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.

Returns:

A PyTorch Iterable Dataset that yields data.

Attributes

data_sources