Data Ingestion API¶

Snowflake Datasinks¶

class snowflake.ml.ray.datasink.table_datasink.SnowflakeTableDatasink(*, table_name: str, database: str | None = None, schema: str | None = None, auto_create_table: bool = False, override: bool = False)¶

Bases: Datasink

Snowflake Ray Datasink for writing Ray data blocks to a Snowflake table.

Example::

>>> import pandas as pd
>>> import ray

Copy

>>> pandas_df = pd.DataFrame([(1, "Steve"), (2, "Bob")], columns=["id", "name"])
>>> ray_ds = ray.data.from_pandas(pandas_df)
>>> datasink = SnowflakeTableDatasink(table_name="TABLE", database="DB", schema="SCHEMA")
>>> ray_ds.write_datasink(datasink)

Copy

Parameters:

table_name (str) – The name of the table.
database (str, optional) – The database name. Defaults to session’s current database.
schema (str, optional) – The schema name. Defaults to session’s current database.
auto_create_table (bool, optional) – Whether to automatically create the table if it does not exist. Defaults to False. If auto_create_table is False, the table must already exist in the database.
override (bool, optional) – Whether to override the table if it already exists. Defaults to False. When override is set to False, it will append to the existing table if it exist

Snowflake Stage Datasources¶

class snowflake.ml.ray.datasource.stage_binary_file_datasource.SFStageBinaryFileDataSource(*, stage_location: str, database: str | None = None, schema: str | None = None, file_pattern: str | List[str] | None = None, local_path: str | None = None, include_paths: bool = False, **file_based_datasource_kwargs)¶

Bases: BinaryDatasource

Snowflake stage binary file data source for reading binary files from a Snowflake stage.

Parameters:

stage_location (str) – The location of the files in a stage. Example: “@mystage/path/to/files”
database (str, optional) – The database name. Defaults to session’s current database.
schema (str, optional) – The schema name. Defaults to session’s current database.
file_pattern (str | List[str], optional) – File patterns to filter in stage. Defaults to None.
local_path (str, optional) – The local path to save the files. Defaults to None. None means the file will not be saved to local disk.
include_paths (bool, optional) – Whether to include file paths in the output. If True, each row will be a dict with ‘bytes’ and ‘path’ keys. If False, each row will just be the file bytes.
**file_based_datasource_kwargs – Additional arguments for FileBasedDatasource

class snowflake.ml.ray.datasource.stage_csv_file_datasource.SFStageCSVDataSource(*, stage_location: str, database: str | None = None, schema: str | None = None, file_pattern: str | List[str] | None = None, local_path: str | None = None, arrow_csv_args: Dict[str, Any] | None = None, **file_based_datasource_kwargs)¶

Bases: CSVDatasource

Snowflake stage CSV file data source for reading CSV files from a Snowflake stage.

Parameters:

stage_location (str) – The location of the files in a stage. Example: “@mystage/path/to/files”
database (str, optional) – The database name. Defaults to session’s current database.
schema (str, optional) – The schema name. Defaults to session’s current schema.
file_pattern (str | List[str], optional) – File patterns to filter in stage. Defaults to None.
local_path (str, optional) – The local path to save the files. Defaults to None. None means the file will not be saved to local disk.
arrow_csv_args (Dict[str, Any], optional) – The options to pass to the PyArrow CSV reader. For options, see https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html
**file_based_datasource_kwargs – Additional arguments for FileBasedDatasource

get_read_tasks(parallelism: int) → List[ReadTask]¶

Execute the read and return read tasks.

Parameters:: parallelism – The requested read parallelism. The number of read tasks should equal to this value if possible.
Returns:: A list of read tasks that can be executed to read blocks from the datasource in parallel.

class snowflake.ml.ray.datasource.stage_parquet_file_datasource.SFStageParquetDataSource(*, stage_location: str, database: str | None = None, schema: str | None = None, file_pattern: str | List[str] | None = None, local_path: str | None = None, **file_based_datasource_kwargs)¶

Bases: ParquetDatasource

Snowflake stage parquet file data source for reading parquet files from a Snowflake stage.

Parameters:

stage_location (str) – The location of the files in a stage. Example: “@mystage/path/to/files”
database (str, optional) – The database name. Defaults to session’s current database.
schema (str, optional) – The schema name. Defaults to session’s current database.
file_pattern (str | List[str], optional) – File patterns to filter in stage. Defaults to None.
local_path (str, optional) – The local path to save the files. Defaults to None. None means the file will not be saved to local disk.
**file_based_datasource_kwargs – Additional arguments for FileBasedDatasource

class snowflake.ml.ray.datasource.stage_text_datasource.SFStageTextDataSource(*, stage_location: str, database: str | None = None, schema: str | None = None, file_pattern: str | List[str] | None = None, local_path: str | None = None, encoding: str = 'utf-8', drop_empty_lines: bool = True, **file_based_datasource_kwargs)¶

Bases: TextDatasource

Snowflake stage text data source for reading text files from a Snowflake stage.

Parameters:

stage_location (str) – The location of the files in a stage. Example: “@mystage/path/to/files”
database (str, optional) – The database name. Defaults to session’s current database.
schema (str, optional) – The schema name. Defaults to session’s current database.
file_pattern (str | List[str], optional) – File patterns to filter in stage. Defaults to None.
local_path (str, optional) – The local path to save the files. Defaults to None. None means the file will not be saved to local disk.
encoding (str, optional) – The encoding to read the text files. Defaults to “utf-8”.
drop_empty_lines (bool, optional) – Whether to drop empty lines in the text files. Defaults to True.
**file_based_datasource_kwargs – Additional arguments for FileBasedDatasource

get_read_tasks(parallelism: int) → List[ReadTask]¶

Execute the read and return read tasks.

Parameters:: parallelism – The requested read parallelism. The number of read tasks should equal to this value if possible.
Returns:: A list of read tasks that can be executed to read blocks from the datasource in parallel.

class snowflake.ml.ray.datasource.stage_image_datasource.SFStageImageDataSource(*, stage_location: str, database: str | None = None, schema: str | None = None, file_pattern: str | List[str] | None = None, local_path: str | None = None, image_mode: str | None = None, image_size: Tuple[int, int] | None = None, include_paths: bool = False, **file_based_datasource_kwargs)¶

Bases: ImageDatasource

Snowflake stage image data source for reading image files from a Snowflake stage.

Parameters:

stage_location (str) – The location of the files in a stage. Example: “@mystage/path/to/files”
database (str, optional) – The database name. Defaults to session’s current database.
schema (str, optional) – The schema name. Defaults to session’s current database.
file_pattern (str | List[str], optional) – File patterns to filter in stage. Defaults to None.
local_path (str, optional) – The local path to save the files. Defaults to None. None means the file will not be saved to local disk.
image_size (Tuple[int, int], optional) – The size to resize the image files. Defaults to original size. Example: (224, 224)
image_mode (str, optional) – The Pillow image mode to convert images to. See https://pillow.readthedocs.io/en/stable/handbook/concepts.html Example: “RGB”, “RGBA”, “L” (grayscale)
include_paths (bool, optional) – Whether to include file paths in the output. If True, each row will be a dict with ‘image’ and ‘path’ keys. If False, each row will just be the image array.
**file_based_datasource_kwargs – Additional arguments for FileBasedDatasource

get_read_tasks(parallelism: int) → List[ReadTask]¶

Execute the read and return read tasks.

Parameters:: parallelism – The requested read parallelism. The number of read tasks should equal to this value if possible.
Returns:: A list of read tasks that can be executed to read blocks from the datasource in parallel.