snowflake.ml.dataset.Dataset¶
- class snowflake.ml.dataset.Dataset(session: Session, database: str, schema: str, name: str, selected_version: Optional[str] = None)¶
- Bases: - LineageNode- Represents a Snowflake Dataset which is organized into versions. - Initialize a lazily evaluated Dataset object - Methods - static create(session: Session, name: str, exist_ok: bool = False) Dataset¶
- Create a new Snowflake Dataset. DatasetVersions can created from the Dataset object using Dataset.create_version() and loaded with Dataset.version(). - Parameters:
- session – Snowpark Session to interact with Snowflake backend. 
- name – Name of dataset to create. May optionally be a schema-level identifier. 
- exist_ok – If False, raises an exception if specified Dataset already exists 
 
- Returns:
- Dataset object representing created dataset 
- Raises:
- ValueError – name is not a valid Snowflake identifier 
- DatasetExistError – Specified Dataset already exists 
- DatasetError – Dataset creation failed 
 
 
 - create_version(version: str, input_dataframe: DataFrame, shuffle: bool = False, exclude_cols: Optional[list[str]] = None, label_cols: Optional[list[str]] = None, properties: Optional[FeatureStoreMetadata] = None, partition_by: Optional[str] = None, comment: Optional[str] = None) Dataset¶
- Create a new version of the current Dataset. - The result Dataset object captures the query result deterministically as stage files. - Parameters:
- version – Dataset version name. Data contents are materialized to the Dataset entity. 
- input_dataframe – A Snowpark DataFrame which yields the Dataset contents. 
- shuffle – A boolean represents whether the data should be shuffled globally. Default to be false. 
- exclude_cols – Name of column(s) in dataset to be excluded during training/testing (e.g. timestamp). 
- label_cols – Name of column(s) in dataset that contains labels. 
- properties – Custom metadata properties, saved under DatasetMetadata.properties 
- partition_by – Optional SQL expression to use as the partitioning scheme within the new Dataset version. 
- comment – A descriptive comment about this dataset. 
 
- Returns:
- A Dataset object with the newly created version selected. 
- Raises:
- SnowflakeMLException – The Dataset no longer exists. 
- SnowflakeMLException – The specified Dataset version already exists. 
- snowpark_exceptions.SnowparkSQLException – An error occurred during Dataset creation. 
 
 - Note: During the generation of stage files, data casting will occur. The casting rules are as follows::
- Data casting:
- DecimalType(NUMBER):
- If its scale is zero, cast to BIGINT 
- If its scale is non-zero, cast to FLOAT 
 
 
- DoubleType(DOUBLE): Cast to FLOAT. 
- ByteType(TINYINT): Cast to SMALLINT. 
- ShortType(SMALLINT):Cast to SMALLINT. 
- IntegerType(INT): Cast to INT. 
- LongType(BIGINT): Cast to BIGINT. 
 
 
- No action:
- FloatType(FLOAT): No action. 
- StringType(String): No action. 
- BinaryType(BINARY): No action. 
- BooleanType(BOOLEAN): No action. 
 
 
- Not supported:
- ArrayType(ARRAY): Not supported. A warning will be logged. 
- MapType(OBJECT): Not supported. A warning will be logged. 
- TimestampType(TIMESTAMP): Not supported. A warning will be logged. 
- TimeType(TIME): Not supported. A warning will be logged. 
- DateType(DATE): Not supported. A warning will be logged. 
- VariantType(VARIANT): Not supported. A warning will be logged. 
 
 
 
 
 - delete() None¶
- Delete Dataset and all contained versions 
 - delete_version(version_name: str) None¶
- Delete the Dataset version - Parameters:
- version_name – Name of version to delete from Dataset 
- Raises:
- SnowflakeMLException – An error occurred when the DatasetVersion cannot get deleted. 
 
 - lineage(direction: Literal['upstream', 'downstream'] = 'downstream', domain_filter: Optional[set[Literal['feature_view', 'dataset', 'model', 'table', 'view']]] = None) list[typing.Union[ForwardRef('feature_view.FeatureView'), ForwardRef('dataset.Dataset'), ForwardRef('model_version_impl.ModelVersion'), ForwardRef('LineageNode')]]¶
- Retrieves the lineage nodes connected to this node. - Parameters:
- direction – The direction to trace lineage. Defaults to “downstream”. 
- domain_filter – Set of domains to filter nodes. Defaults to None. 
 
- Returns:
- A list of connected lineage nodes. 
- Return type:
- List[LineageNode] 
 
 - list_versions(detailed: bool = False) Union[list[str], list[snowflake.snowpark.row.Row]]¶
- Return list of versions 
 - static load(session: Session, name: str) Dataset¶
- Load an existing Snowflake Dataset. DatasetVersions can be created from the Dataset object using Dataset.create_version() and loaded with Dataset.version(). - Parameters:
- session – Snowpark Session to interact with Snowflake backend. 
- name – Name of dataset to load. May optionally be a schema-level identifier. 
 
- Returns:
- Dataset object representing loaded dataset 
- Raises:
- ValueError – name is not a valid Snowflake identifier 
- DatasetNotExistError – Specified Dataset does not exist 
 
 
 - select_version(version: str) Dataset¶
- Return a new Dataset instance with the specified version selected. - Parameters:
- version – Dataset version name. 
- Returns:
- Dataset object. 
 
 - Attributes - fully_qualified_name¶
 - read¶
 - selected_version¶