snowflake.ml.dataset.Dataset¶
- class snowflake.ml.dataset.Dataset(session: Session, database: str, schema: str, name: str, selected_version: Optional[str] = None)¶
Bases:
LineageNode
Represents a Snowflake Dataset which is organized into versions.
Initialize a lazily evaluated Dataset object
Methods
- static create(session: Session, name: str, exist_ok: bool = False) Dataset ¶
Create a new Snowflake Dataset. DatasetVersions can created from the Dataset object using Dataset.create_version() and loaded with Dataset.version().
- Parameters:
session – Snowpark Session to interact with Snowflake backend.
name – Name of dataset to create. May optionally be a schema-level identifier.
exist_ok – If False, raises an exception if specified Dataset already exists
- Returns:
Dataset object representing created dataset
- Raises:
ValueError – name is not a valid Snowflake identifier
DatasetExistError – Specified Dataset already exists
DatasetError – Dataset creation failed
- create_version(version: str, input_dataframe: DataFrame, shuffle: bool = False, exclude_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None, properties: Optional[FeatureStoreMetadata] = None, partition_by: Optional[str] = None, comment: Optional[str] = None) Dataset ¶
Create a new version of the current Dataset.
The result Dataset object captures the query result deterministically as stage files.
- Parameters:
version – Dataset version name. Data contents are materialized to the Dataset entity.
input_dataframe – A Snowpark DataFrame which yields the Dataset contents.
shuffle – A boolean represents whether the data should be shuffled globally. Default to be false.
exclude_cols – Name of column(s) in dataset to be excluded during training/testing (e.g. timestamp).
label_cols – Name of column(s) in dataset that contains labels.
properties – Custom metadata properties, saved under DatasetMetadata.properties
partition_by – Optional SQL expression to use as the partitioning scheme within the new Dataset version.
comment – A descriptive comment about this dataset.
- Returns:
A Dataset object with the newly created version selected.
- Raises:
SnowflakeMLException – The Dataset no longer exists.
SnowflakeMLException – The specified Dataset version already exists.
snowpark_exceptions.SnowparkSQLException – An error occurred during Dataset creation.
- Note: During the generation of stage files, data casting will occur. The casting rules are as follows::
- Data casting:
- DecimalType(NUMBER):
If its scale is zero, cast to BIGINT
If its scale is non-zero, cast to FLOAT
DoubleType(DOUBLE): Cast to FLOAT.
ByteType(TINYINT): Cast to SMALLINT.
ShortType(SMALLINT):Cast to SMALLINT.
IntegerType(INT): Cast to INT.
LongType(BIGINT): Cast to BIGINT.
- No action:
FloatType(FLOAT): No action.
StringType(String): No action.
BinaryType(BINARY): No action.
BooleanType(BOOLEAN): No action.
- Not supported:
ArrayType(ARRAY): Not supported. A warning will be logged.
MapType(OBJECT): Not supported. A warning will be logged.
TimestampType(TIMESTAMP): Not supported. A warning will be logged.
TimeType(TIME): Not supported. A warning will be logged.
DateType(DATE): Not supported. A warning will be logged.
VariantType(VARIANT): Not supported. A warning will be logged.
- delete() None ¶
Delete Dataset and all contained versions
- delete_version(version_name: str) None ¶
Delete the Dataset version
- Parameters:
version_name – Name of version to delete from Dataset
- Raises:
SnowflakeMLException – An error occurred when the DatasetVersion cannot get deleted.
- lineage(direction: Literal['upstream', 'downstream'] = 'downstream', domain_filter: Optional[Set[Literal['feature_view', 'dataset', 'model', 'table', 'view']]] = None) List[Union[FeatureView, Dataset, ModelVersion, LineageNode]] ¶
Retrieves the lineage nodes connected to this node.
- Parameters:
direction – The direction to trace lineage. Defaults to “downstream”.
domain_filter – Set of domains to filter nodes. Defaults to None.
- Returns:
A list of connected lineage nodes.
- Return type:
List[LineageNode]
This function or method is in private preview since 1.5.3.
- list_versions(detailed: bool = False) Union[List[str], List[Row]] ¶
Return list of versions
- static load(session: Session, name: str) Dataset ¶
Load an existing Snowflake Dataset. DatasetVersions can be created from the Dataset object using Dataset.create_version() and loaded with Dataset.version().
- Parameters:
session – Snowpark Session to interact with Snowflake backend.
name – Name of dataset to load. May optionally be a schema-level identifier.
- Returns:
Dataset object representing loaded dataset
- Raises:
ValueError – name is not a valid Snowflake identifier
DatasetNotExistError – Specified Dataset does not exist
- select_version(version: str) Dataset ¶
Return a new Dataset instance with the specified version selected.
- Parameters:
version – Dataset version name.
- Returns:
Dataset object.
Attributes
- fully_qualified_name¶
- read¶
- selected_version¶