Engineer features

Snowflake ML allows you to transform your raw data into features, allowing for efficient use by machine learning models. You can transform data using several approaches, each suited for different scales and requirements:

  • Open Source Software (OSS) preprocessors - For small to medium datasets and quick prototyping, use familiar Python ML libraries that run locally or on single nodes within Container Runtime.

  • Snowflake ML Preprocessors - For larger datasets, use Snowflake ML’s preprocessing APIs that execute natively on the Snowflake platform. These APIs distribute the processing across warehouse compute resources.

  • Ray map_batches - For highly customizable large-scale processing, especially with unstructured data, use parallel, resource-managed execution across single-node or multi-node Container Runtime environments.

Choose the approach that best matches your data size, performance requirements, and need for custom transformation logic.

The following table shows detailed comparisons of three main approaches for feature engineering in Snowflake ML:

Feature/Aspect

OSS (including scikit-learn)

Snowflake ML preprocessors

Ray map_batches

Scale

Small & medium datasets

Large/distributed data

Large/distributed data

Execution Environment

In memory

Pushdown to the default warehouse that you’re using to run SQL queries

Across nodes in a compute pool

Compute Resources

Snowpark Container Services (Compute Pool)

Warehouse

Snowpark Container Services (Compute Pool)

Integration

Standard Python ML ecosystem

Integrates natively with Snowflake ML

Both with Python ML and Snowflake

Performance

Fast for local, in-memory workloads; scale limited and non-distributed

Designed for scalable, distributed feature engineering

Highly parallel and resource-managed, excels on large/unstructured data

Use Case Suitability

Quickly prototyping and experimentation

Production workflows with large datasets

Large data workflows that require custom resource controls

The following examples demonstrate how to implement feature transformations using each approach:

Use the following code to implement scikit-learn for your preprocessing workflows:

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Load your data locally into a Pandas DataFrame
df = pd.DataFrame({
    'age': [34, 23, 54, 31],
    'city': ['SF', 'NY', 'SF', 'LA'],
    'income': [120000, 95000, 135000, 99000]
})

# Define preprocessing steps
numeric_features = ['age', 'income']
numeric_transformer = StandardScaler()

categorical_features = ['city']
categorical_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

# Preprocess the data
X_processed = pipeline.fit_transform(df)
print(X_processed)
Copy