snowflake.ml.dataset.DatasetReader¶

class snowflake.ml.dataset.DatasetReader(ingestor: DataIngestor, *, snowpark_session: Optional[Session] = None)¶

Bases: DataConnector, SerializableSessionMixin

Snowflake Dataset abstraction which provides application integration connectors

Methods

files() → list[str]¶

Get the list of remote file paths for the current DatasetVersion.

The file paths follows the snow protocol.

Returns:: A list of remote file paths

Example: >>> dsv.files() —- [“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_0.snappy.parquet”,

“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_1.snappy.parquet”]

filesystem() → SnowFileSystem¶: Return an fsspec FileSystem which can be used to load the DatasetVersion’s files()

classmethod from_dataframe(df: DataFrame, ingestor_class: Optional[type[DataIngestor]] = None, **kwargs: Any) → DatasetReader¶

classmethod from_dataset(ds: Dataset, ingestor_class: Optional[type[DataIngestor]] = None, **kwargs: Any) → DataConnectorType¶

classmethod from_ray_dataset(ray_ds: ray.data.Dataset, ingestor_class: Optional[type[DataIngestor]] = None, **kwargs: Any) → DataConnectorType¶

classmethod from_sources(session: Session, sources: Sequence[Union[DataFrameInfo, DatasetInfo, str]], ingestor_class: Optional[type[DataIngestor]] = None, **kwargs: Any) → DataConnectorType¶

classmethod from_sql(query: str, session: Optional[Session] = None, ingestor_class: Optional[type[DataIngestor]] = None, **kwargs: Any) → DataConnectorType¶

to_huggingface_dataset(*, streaming: bool = False, limit: Optional[int] = None, batch_size: int = 1024, shuffle: bool = False, drop_last_batch: bool = False) → Union[hf_datasets.Dataset, hf_datasets.IterableDataset]¶

Retrieve the Snowflake data as a HuggingFace Dataset.

Parameters:

streaming – If True, returns an IterableDataset that streams data in batches. If False (default), returns an in-memory Dataset.
limit – Maximum number of rows to load. If None, loads all rows.
batch_size – Size of batches for internal data retrieval.
shuffle – Whether to shuffle the data. If True, files will be shuffled and rows in each file will also be shuffled.
drop_last_batch – Whether to drop the last batch if it’s smaller than batch_size.

Returns:

A HuggingFace Dataset (in-memory) or IterableDataset (streaming).

to_pandas(limit: Optional[int] = None) → pd.DataFrame¶

Retrieve the Snowflake data as a Pandas DataFrame.

Parameters:: limit – If specified, the maximum number of rows to load into the DataFrame.
Returns:: A Pandas DataFrame.

to_ray_dataset() → ray.data.Dataset¶

Retrieve the Snowflake data as a Ray Dataset.

Returns:: A Ray Dataset.
Raises:: ImportError – If Ray is not installed in the local environment.

to_snowpark_dataframe(only_feature_cols: bool = False) → DataFrame¶

Convert the DatasetVersion to a Snowpark DataFrame.

Parameters:: only_feature_cols – If True, drops exclude_cols and label_cols from returned DataFrame. The original DatasetVersion is unaffected.
Returns:: A Snowpark dataframe that contains the data of this DatasetVersion.

Note: The dataframe generated by this method might not have the same schema as the original one. Specifically,

NUMBER type with scale != 0 will become float.
Unsupported types (see comments of Dataset.create_version()) will not have any guarantee.
For example, an OBJECT column may be scanned back as a STRING column.

to_tf_dataset(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) → tf.data.Dataset¶

Transform the Snowflake data into a ready-to-use TensorFlow tf.data.Dataset.

Parameters:

batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A tf.data.Dataset that yields batched tf.Tensors.

to_torch_datapipe(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) → torch_data.IterDataPipe¶

Transform the Snowflake data into a ready-to-use Pytorch datapipe.

Return a Pytorch datapipe which iterates on rows of data.

Parameters:

batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A Pytorch iterable datapipe that yield data.

to_torch_dataset(*, batch_size: Optional[int] = None, shuffle: bool = False, drop_last_batch: bool = True) → torch_data.IterableDataset¶

Transform the Snowflake data into a PyTorch Iterable Dataset to be used with a DataLoader.

Return a PyTorch Dataset which iterates on rows of data.

Parameters:

batch_size – It specifies the size of each data batch which will be yielded in the result dataset. Batching is pushed down to data ingestion level which may be more performant than DataLoader batching.
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A PyTorch Iterable Dataset that yields data.

Attributes

data_sources¶