You are viewing documentation about an older version (1.7.0). View latest version

snowflake.ml.dataset.Dataset

class snowflake.ml.dataset.Dataset(session: Session, database: str, schema: str, name: str, selected_version: Optional[str] = None)

Bases: LineageNode

Represents a Snowflake Dataset which is organized into versions.

Initialize a lazily evaluated Dataset object

Methods

static create(session: Session, name: str, exist_ok: bool = False) Dataset

Create a new Snowflake Dataset. DatasetVersions can created from the Dataset object using Dataset.create_version() and loaded with Dataset.version().

Parameters:
  • session – Snowpark Session to interact with Snowflake backend.

  • name – Name of dataset to create. May optionally be a schema-level identifier.

  • exist_ok – If False, raises an exception if specified Dataset already exists

Returns:

Dataset object representing created dataset

Raises:
  • ValueError – name is not a valid Snowflake identifier

  • DatasetExistError – Specified Dataset already exists

  • DatasetError – Dataset creation failed

create_version(version: str, input_dataframe: DataFrame, shuffle: bool = False, exclude_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None, properties: Optional[FeatureStoreMetadata] = None, partition_by: Optional[str] = None, comment: Optional[str] = None) Dataset

Create a new version of the current Dataset.

The result Dataset object captures the query result deterministically as stage files.

Parameters:
  • version – Dataset version name. Data contents are materialized to the Dataset entity.

  • input_dataframe – A Snowpark DataFrame which yields the Dataset contents.

  • shuffle – A boolean represents whether the data should be shuffled globally. Default to be false.

  • exclude_cols – Name of column(s) in dataset to be excluded during training/testing (e.g. timestamp).

  • label_cols – Name of column(s) in dataset that contains labels.

  • properties – Custom metadata properties, saved under DatasetMetadata.properties

  • partition_by – Optional SQL expression to use as the partitioning scheme within the new Dataset version.

  • comment – A descriptive comment about this dataset.

Returns:

A Dataset object with the newly created version selected.

Raises:
  • SnowflakeMLException – The Dataset no longer exists.

  • SnowflakeMLException – The specified Dataset version already exists.

  • snowpark_exceptions.SnowparkSQLException – An error occurred during Dataset creation.

Note: During the generation of stage files, data casting will occur. The casting rules are as follows::
  • Data casting:
    • DecimalType(NUMBER):
      • If its scale is zero, cast to BIGINT

      • If its scale is non-zero, cast to FLOAT

    • DoubleType(DOUBLE): Cast to FLOAT.

    • ByteType(TINYINT): Cast to SMALLINT.

    • ShortType(SMALLINT):Cast to SMALLINT.

    • IntegerType(INT): Cast to INT.

    • LongType(BIGINT): Cast to BIGINT.

  • No action:
    • FloatType(FLOAT): No action.

    • StringType(String): No action.

    • BinaryType(BINARY): No action.

    • BooleanType(BOOLEAN): No action.

  • Not supported:
    • ArrayType(ARRAY): Not supported. A warning will be logged.

    • MapType(OBJECT): Not supported. A warning will be logged.

    • TimestampType(TIMESTAMP): Not supported. A warning will be logged.

    • TimeType(TIME): Not supported. A warning will be logged.

    • DateType(DATE): Not supported. A warning will be logged.

    • VariantType(VARIANT): Not supported. A warning will be logged.

delete() None

Delete Dataset and all contained versions

delete_version(version_name: str) None

Delete the Dataset version

Parameters:

version_name – Name of version to delete from Dataset

Raises:

SnowflakeMLException – An error occurred when the DatasetVersion cannot get deleted.

lineage(direction: Literal['upstream', 'downstream'] = 'downstream', domain_filter: Optional[Set[Literal['feature_view', 'dataset', 'model', 'table', 'view']]] = None) List[Union[FeatureView, Dataset, ModelVersion, LineageNode]]

Retrieves the lineage nodes connected to this node.

Parameters:
  • direction – The direction to trace lineage. Defaults to “downstream”.

  • domain_filter – Set of domains to filter nodes. Defaults to None.

Returns:

A list of connected lineage nodes.

Return type:

List[LineageNode]

This function or method is in private preview since 1.5.3.

list_versions(detailed: bool = False) Union[List[str], List[Row]]

Return list of versions

static load(session: Session, name: str) Dataset

Load an existing Snowflake Dataset. DatasetVersions can be created from the Dataset object using Dataset.create_version() and loaded with Dataset.version().

Parameters:
  • session – Snowpark Session to interact with Snowflake backend.

  • name – Name of dataset to load. May optionally be a schema-level identifier.

Returns:

Dataset object representing loaded dataset

Raises:
  • ValueError – name is not a valid Snowflake identifier

  • DatasetNotExistError – Specified Dataset does not exist

select_version(version: str) Dataset

Return a new Dataset instance with the specified version selected.

Parameters:

version – Dataset version name.

Returns:

Dataset object.

Attributes

fully_qualified_name
read
selected_version