snowflake.ml.dataset.DatasetReader¶

class snowflake.ml.dataset.DatasetReader(ingestor: DataIngestor, *, snowpark_session: Session)¶

Bases: DataConnector

Snowflake Dataset abstraction which provides application integration connectors

Methods

files() → List[str]¶

Get the list of remote file paths for the current DatasetVersion.

The file paths follows the snow protocol.

Example: >>> dsv.files() —- [“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_0.snappy.parquet”,

“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_1.snappy.parquet”]

filesystem() → SnowFileSystem¶: Return an fsspec FileSystem which can be used to load the DatasetVersion’s files()

classmethod from_dataframe(df: DataFrame, ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) → DatasetReader¶: This function or method is in private preview since 1.6.0.

classmethod from_dataset(ds: Dataset, ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) → DataConnectorType¶

classmethod from_sources(session: Session, sources: List[Union[DataFrameInfo, DatasetInfo, str]], ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) → DataConnectorType¶

to_pandas(limit: Optional[int] = None) → pd.DataFrame¶

Retrieve the Snowflake data as a Pandas DataFrame.

Parameters:: limit – If specified, the maximum number of rows to load into the DataFrame.
Returns:: A Pandas DataFrame.

to_snowpark_dataframe(only_feature_cols: bool = False) → DataFrame¶

Convert the DatasetVersion to a Snowpark DataFrame.

Parameters:: only_feature_cols – If True, drops exclude_cols and label_cols from returned DataFrame. The original DatasetVersion is unaffected.
Returns:: A Snowpark dataframe that contains the data of this DatasetVersion.

Note: The dataframe generated by this method might not have the same schema as the original one. Specifically,

NUMBER type with scale != 0 will become float.
Unsupported types (see comments of Dataset.create_version()) will not have any guarantee.
For example, an OBJECT column may be scanned back as a STRING column.

to_tf_dataset(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) → tf.data.Dataset¶

Transform the Snowflake data into a ready-to-use TensorFlow tf.data.Dataset.

Parameters:

batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A tf.data.Dataset that yields batched tf.Tensors.

to_torch_datapipe(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) → torch_data.IterDataPipe¶

Transform the Snowflake data into a ready-to-use Pytorch datapipe.

Return a Pytorch datapipe which iterates on rows of data.

Parameters:

batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A Pytorch iterable datapipe that yield data.

to_torch_dataset(*, shuffle: bool = False) → torch_data.IterableDataset¶

Transform the Snowflake data into a PyTorch Iterable Dataset to be used with a DataLoader.

Return a PyTorch Dataset which iterates on rows of data.

Parameters:: shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
Returns:: A PyTorch Iterable Dataset that yields data.

Attributes