snowflake.ml.dataset.DatasetReader¶
- class snowflake.ml.dataset.DatasetReader(ingestor: DataIngestor, *, snowpark_session: Session)¶
Bases:
DataConnector
Snowflake Dataset abstraction which provides application integration connectors
Methods
- files() List[str] ¶
Get the list of remote file paths for the current DatasetVersion.
The file paths follows the snow protocol.
- Returns:
A list of remote file paths
Example: >>> dsv.files() —- [“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_0.snappy.parquet”,
“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_1.snappy.parquet”]
- filesystem() SnowFileSystem ¶
Return an fsspec FileSystem which can be used to load the DatasetVersion’s files()
- classmethod from_dataframe(df: DataFrame, ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) DatasetReader ¶
This function or method is in private preview since 1.6.0.
- classmethod from_dataset(ds: Dataset, ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) DataConnectorType ¶
- classmethod from_sources(session: Session, sources: List[Union[DataFrameInfo, DatasetInfo, str]], ingestor_class: Optional[Type[DataIngestor]] = None, **kwargs: Any) DataConnectorType ¶
- to_pandas(limit: Optional[int] = None) pd.DataFrame ¶
Retrieve the Snowflake data as a Pandas DataFrame.
- Parameters:
limit – If specified, the maximum number of rows to load into the DataFrame.
- Returns:
A Pandas DataFrame.
- to_snowpark_dataframe(only_feature_cols: bool = False) DataFrame ¶
Convert the DatasetVersion to a Snowpark DataFrame.
- Parameters:
only_feature_cols – If True, drops exclude_cols and label_cols from returned DataFrame. The original DatasetVersion is unaffected.
- Returns:
A Snowpark dataframe that contains the data of this DatasetVersion.
- Note: The dataframe generated by this method might not have the same schema as the original one. Specifically,
NUMBER type with scale != 0 will become float.
- Unsupported types (see comments of
Dataset.create_version()
) will not have any guarantee. For example, an OBJECT column may be scanned back as a STRING column.
- Unsupported types (see comments of
- to_tf_dataset(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) tf.data.Dataset ¶
Transform the Snowflake data into a ready-to-use TensorFlow tf.data.Dataset.
- Parameters:
batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.
- Returns:
A tf.data.Dataset that yields batched tf.Tensors.
- to_torch_datapipe(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) torch_data.IterDataPipe ¶
Transform the Snowflake data into a ready-to-use Pytorch datapipe.
Return a Pytorch datapipe which iterates on rows of data.
- Parameters:
batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.
- Returns:
A Pytorch iterable datapipe that yield data.
- to_torch_dataset(*, shuffle: bool = False) torch_data.IterableDataset ¶
Transform the Snowflake data into a PyTorch Iterable Dataset to be used with a DataLoader.
Return a PyTorch Dataset which iterates on rows of data.
- Parameters:
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
- Returns:
A PyTorch Iterable Dataset that yields data.
Attributes
- data_sources¶