Advanced feature engineering¶
This page covers advanced feature patterns you’ll use when moving from basic feature sets to production-grade ML systems.
Overview of feature patterns¶
The following patterns describe the different ways you can define, compute, and serve features in Snowflake Feature Store. These aren’t mutually exclusive categories. A single feature view can combine multiple patterns. For example, a managed feature view can also be served online with time-windowed aggregations. Think of each pattern as a capability you can layer onto your feature views as your requirements evolve:
| Pattern | Description | Online retrieval | How you build it | Examples |
|---|---|---|---|---|
| External | Defined and refreshed outside Feature Store (often static or slow-changing). | Yes | Feature View with externally maintained table or view, no refresh_freq. | Account tier, signup channel |
| Managed | Feature Store computes and refreshes on a schedule. | Yes | Feature View with refresh_freq specified. | Daily engagement score, hourly KPIs |
| Online | Low-latency “latest values” lookup for inference. | Yes | Online feature store or table synced from Feature Views. | Real-time churn scoring, fraud scoring |
| Time-windowed | Trailing-window aggregates over recent history. | Yes | Feature View using the Aggregations API with tiling (feature_granularity, refresh_freq). | Spend 7d, orders 30d, last N items |
| Rollup | Aggregates features from a lower-level entity to a higher-level entity through a mapping. | No | Rollup Feature View from a source Feature View and a mapping DataFrame. | Visitor to subscriber, card to account |
| Iceberg | Open-format features stored as Dynamic Iceberg Tables for cross-engine interoperability. | No | Feature View with StorageConfig pointing to an external volume with StorageFormat.ICEBERG. | Features consumed by Spark/Trino, data lake integration |
| Stream (Public Preview) | Real-time event ingestion with near-zero latency feature updates. | Yes | Feature View with StreamSource and StreamConfig for continuous ingestion. | Live clickstream signals, real-time transaction features |
| Real-time (Public Preview) | On-demand features computed at read time from upstream feature views and per-request inputs. | Yes | Feature View with RealtimeConfig, compute_fn, and RequestSource. | Weighted balance, currency conversion, derived scores |
Online features¶
Online feature serving provides low-latency feature retrieval for real-time inference. It isn’t a separate pattern but a serving configuration you can layer on top of most other patterns. Enabling online serving synchronizes the latest feature values keyed by entity so applications can fetch features in milliseconds rather than running warehouse queries.
Enabling online retrieval doesn’t change how features are computed for offline datasets. It changes where and how feature values are stored for serving, and synchronization frequency between online and offline store.
For end-to-end instructions on creating online feature tables using hybrid tables (GA), see Create and serve online features. For ultra-low-latency retrieval backed by Snowflake Postgres, see the Online Feature Store (Preview).
Time-windowed aggregation features¶
Note
Time-windowed aggregation requires snowflake-ml-python version 1.24.0 or later.
Time-windowed aggregation computes rolling metrics over recent history, such as “spend in the last 7 days” or “number of sessions in the last 30 days.” Use this pattern when your model needs features that summarize recent behavior within a trailing time horizon and must stay fresh as new events arrive.
With time-windowed aggregations you can:
- Define multiple windows (for example, 1h, 24h, 7d, 30d) over the same event stream once and reuse them across many models.
- Generate training datasets that are point-in-time correct, so each training row only uses data that would have been available as of the label or event timestamp.
- Reduce compute cost by incrementally maintaining partial aggregates (tiles) instead of repeatedly scanning raw events.
Define time-windowed features¶
Use the Feature class to define aggregate features in the FeatureView definition:
| Parameter | Description |
|---|---|
features | List of Feature objects defining the aggregation logic. |
feature_granularity | The tile size: how frequently aggregation tiles are computed (for example, "1h"). |
timestamp_col | The column used for time-indexing. |
Supported aggregation functions:
Feature.sum(column, window): Sum over a time windowFeature.count(column, window): Count over a time windowFeature.avg(column, window): Average over a time windowFeature.last_n(column, window, n): Last N values in a time windowFeature.approx_count_distinct(column, window): Approximate distinct count over a time window
The following example defines entities, aggregation features, and creates a tiled feature view:
To make computation scalable, the Feature Store maintains intermediate results at
a fixed feature_granularity interval (often hourly or daily). These intermediate
results are refreshed on the refresh_freq schedule, then stitched together at
query time to produce “last 7d”, “last 30d”, and similar windows.
Generate a training set with tiled features¶
When generating a training set that includes tiled feature views, you must pass
join_method="cte" to generate_training_set:
Using window offset¶
The offset parameter shifts the lookback window into the past, which is the
standard way to build comparative features such as week-over-week or month-over-month
trends. For example, a 7-day spend feature with offset="7d" returns the previous
7-day period relative to the current tile boundary. You can pair this with the current
window to capture momentum or change over time.
The offset must be a multiple of feature_granularity so the shifted window
aligns cleanly to tile boundaries.
Transformations alongside aggregation¶
In many pipelines, raw events need preparation before they can be aggregated. If you
provide both feature_df and features in a FeatureView, the Feature Store
applies them in a clear order: the feature_df transformation runs first to define
and prepare the base dataset, including any joins, filters, or derived columns. The
declarative Feature aggregations specified in features are then computed on top
of that resulting dataset.
For example, suppose you have raw event data where an EVENT_JSON column contains
nested attributes that must be parsed before aggregation. You can use SQL in
feature_df to extract structured fields, then apply time-windowed aggregations
using features:
Best practices for granularity and refresh¶
Choosing feature_granularity and refresh_freq is a trade-off between time
precision, freshness, and operational cost:
-
Match granularity to signal velocity. Hourly granularity is a good default for clickstream or transactional activity where recency matters. Daily granularity is often sufficient for slower-moving signals such as account-level properties.
-
Align windows and offsets to the tile size. Window lengths should be an even multiple of
feature_granularity(for example,"24h"with"1h"tiles, or"28d"with"1d"tiles) so the approximation error margin stays consistent over time. -
Set refresh_freq to the slowest cadence that meets your freshness needs. Refreshing more frequently than new data arrives rarely improves feature quality but does increase compute. In production, it’s common to standardize on a small set of granularity and refresh combinations (for example, hourly and daily) to keep cost predictable.
Rollup aggregation features¶
Note
Rollup aggregation requires snowflake-ml-python version 1.26.0 or later.
Rollup aggregation lets you derive higher-level features from existing lower-level feature views without reprocessing raw events. Use this pattern whenever your model operates at a coarser granularity than your source features, such as rolling product-level metrics up to categories, user-level signals up to cohorts, or transaction-level features up to merchants.
In Snowflake Feature Store, a rollup Feature View is defined from two inputs:
- A registered source Feature View at the lower-level entity.
- A mapping dataset that maps lower-level keys to higher-level keys.
The Feature Store applies the mapping and aggregates the source feature values to produce features keyed by the higher-level entity.
Example: Product to category rollup¶
Assume you already compute product-level features (one row per PRODUCT_ID), and
you want category-level features (one row per CATEGORY_ID) by rolling up all
products in the category.
Source Feature View output (PRODUCT_ID level):
The following shows example output from a registered source Feature View
PRODUCT_SALES_FV:
| PRODUCT_ID | UNITS_SOLD_30D | REVENUE_30D |
|---|---|---|
| P101 | 120 | 2400.00 |
| P102 | 35 | 700.00 |
| P201 | 80 | 1600.00 |
Mapping table (PRODUCT_ID to CATEGORY_ID):
| PRODUCT_ID | CATEGORY_ID |
|---|---|
| P101 | CAT10 |
| P102 | CAT10 |
| P201 | CAT20 |
To create the category-level rollup, provide the source Feature View and a mapping
DataFrame, then register a new Feature View keyed by CATEGORY_ID:
This gives you category-level features that are consistent with the product-level definitions and reusable for models that operate at the category level (for example, category demand forecasting).
Rolled-up result (CATEGORY_ID level):
| CATEGORY_ID | UNITS_SOLD_30D_SUM | REVENUE_30D_SUM | PRODUCT_COUNT |
|---|---|---|---|
| CAT10 | 155 | 3100.00 | 2 |
| CAT20 | 80 | 1600.00 | 1 |
Once registered, a rollup Feature View is consumed like any other Feature View.
You join it to a spine using the target entity key (CATEGORY_ID in this example).
Downstream users don’t need to know whether features came from raw events or from a
rollup. They simply request features from the Feature View they need.
Feature column prefixing for disambiguation¶
When generating datasets from multiple feature views, column name collisions can
occur if different feature views contain features with identical names (for example,
COUNT_7D). Snowflake provides two ways to disambiguate column names.
Option 1: Auto-prefix
Use auto_prefix=True to automatically prefix all feature columns with
{FV_NAME}_{VERSION}_, which guarantees uniqueness when multiple Feature Views
contain the same feature names.
Option 2: Custom names
Use .with_name() to assign readable custom prefixes to specific feature views.
Stream feature views¶
Public Preview
This feature is in public preview.
Stream feature views provide continuous, near-real-time feature updates from live event
streams. Use this pattern when your model needs features that reflect the very latest
events, with end-to-end freshness of less than 2 seconds, such as live clickstream signals
or real-time transaction features. Stream feature views use a StreamSource and
StreamConfig to define transformation logic and historical backfill data, and can be
combined with time-windowed aggregation to compute rolling metrics that update
continuously as new events arrive.
For more details, including how to register a stream source, create a stream feature view, and combine streaming with time-windowed aggregation, see Online Feature Store (Preview).
Real-time feature views¶
Public Preview
This feature is in public preview.
Real-time feature views evaluate a Python function during each query to produce features that can’t be precomputed, whether that means incorporating per-request inputs like a transaction amount or device fingerprint, deriving new values by combining upstream feature views (for example, computing a z-score from a stored mean and standard deviation), or applying last-mile transformations such as filling nulls or converting units before the data reaches your model.
For more details on how to use real-time feature views, see Online Feature Store (Preview).
Append-only batch feature view¶
Note
Requires snowflake-ml-python version 1.41 or later.
Append-only batch feature views preserve a complete history of feature snapshots for point-in-time correct training. Use this pattern when your model training requires knowing exactly what feature values looked like in a past moment. Both standard and append-only batch feature views produce point-in-time correct training data. The difference is how much history they retain. Standard batch feature views keep only the latest values: each refresh overwrites the previous snapshot, so training joins are always against the most recent version. Append-only batch feature views retain every version by appending the current feature values alongside a timestamp on each refresh, building up a full history of how features changed over time. This deeper history lets the Feature Store join the feature values that were current as of each row’s timestamp in your training spine, which is important when feature drift matters and you need to reconstruct what the model would have seen at any point in the past.
How it works¶
When you set append_only=True on a FeatureView, each scheduled refresh
appends the current feature values to a persistent snapshot table managed by
the Feature Store. Over time, this table accumulates a time series of feature
snapshots. This parameter requires timestamp_col and a cron expression for
refresh_freq.
Backfill from existing history¶
If you already have historical feature snapshots, pass backup_source with
the fully qualified table name to seed the snapshot table at registration time.
The Feature Store clones the backup table (a zero-copy operation) and validates
that it contains the required entity join keys and timestamp column.
Schema evolution¶
Append-only feature views support extend-only schema changes: you can add new
columns to the source, but dropping, reordering, or changing the data type of
existing columns isn’t supported. Re-registering with overwrite=True isn’t
allowed for append-only feature views. If you re-register an existing
append-only feature view as a standard (non-append-only) feature view with
overwrite=True, the accumulated snapshot table is dropped.
Generate point-in-time correct training sets¶
As with regular batch feature views, use generate_dataset with spine_timestamp_col
to build training sets from the accumulated snapshots. For each row in the
spine, the Feature Store performs an ASOF join and selects the most recent
snapshot row for each entity key at or before the spine timestamp. This ensures
that the training set reflects the features as they existed at the time of
each training example, preventing future data from leaking into the model.
The spine_timestamp_col column must also exist in the feature view’s output.
When an append-only feature view is used as a feature source,
spine_timestamp_col is required.
Register an append-only feature view:
Build a point-in-time correct training set from the accumulated snapshots. For each row in the spine, the Feature Store performs an as-of join to select the most recent snapshot at or before the spine timestamp:
Iceberg-backed feature views¶
Iceberg-backed feature views store features as Dynamic Iceberg Tables for cross-engine interoperability. Use this pattern when downstream consumers need to read feature data using external engines such as Spark, Trino, or Flink through the Iceberg open table format, or when you want to integrate feature pipelines with a broader data lake architecture.
Note
Requires snowflake-ml-python version 1.26.0 or later. An
external volume configured for
Iceberg storage is also required.
Iceberg-backed feature views don’t support online feature retrieval today. Use them for batch training, offline feature serving, and cross-engine interoperability scenarios.
Configure storage for Iceberg¶
Use StorageConfig to point the feature view at your external volume. The base_location
specifies the subdirectory within the external volume where Iceberg metadata and data files
are written.
Create an Iceberg-backed feature view¶
Pass storage_config when creating the FeatureView. A refresh_freq is required
because the underlying Dynamic Iceberg Table needs a refresh schedule.
Note
Iceberg supports microsecond precision for timestamp types. If your source data uses
nanosecond precision, cast it to microsecond precision (for example, TIMESTAMP(6)) in
your feature DataFrame.