Parallel Hyperparameter Optimization (HPO) on Container Runtime for ML

The Snowflake ML Hyperparameter Optimization (HPO) API is a model-agnostic framework that enables efficient, parallelized hyperparameter tuning of models using popular tuning algorithms.

Today, this API is available for use within a Snowflake Notebook configured to use the Container Runtime on Snowpark Container Services (SPCS). After you create such a notebook, you can:

  • Train a model using any open source package, and use this API to distribute the hyperparameter tuning process

  • Train a model using Snowflake ML distributed training APIs, and scale HPO while also scaling each of the training runs

The HPO workload, initiated from the Notebook, executes inside Snowpark Container Services on either CPU or GPU instances, and scales out to the cores (CPUs or GPUs) available on a single node in the SPCS compute pool.

The parallelized HPO API provides the following benefits:

  • A single API that automatically handles all the complexities of distributing the training across multiple resources

  • The ability to train with virtually any framework or algorithm using open-source ML frameworks or the Snowflake ML modeling APIs

  • A selection of tuning and sampling options, including Bayesian and random search algorithms along with various continuous and non-continuous sampling functions

  • Tight integration with the rest of Snowflake; for example efficient data ingestion via Snowflake Datasets or Dataframes and automatic ML lineage capture

Examples

This example illustrates a typical HPO use case by first ingesting data from a Snowflake table through the Container Runtime DataConnector API, then defining a training function that creates an XGBoost model. The Tuner interface provides the tuning functionality, based on the given training function and search space.

from snowflake.ml.modeling.tune import get_tuner_context
from snowflake.ml.modeling import tune

# Define a training function, with any models you choose within it.
def train_func():
    # A context object provided by HPO API to expose data for the current HPO trial
    tuner_context = get_tuner_context()
    config = tuner_context.get_hyper_params()
    dm = tuner_context.get_dataset_map()

    model = xgb.XGBClassifier(**config, random_state=42)
    model.fit(dm["x_train"].to_pandas(), dm["y_train"].to_pandas())
    accuracy = accuracy_score(
        dm["y_train"].to_pandas(), model.predict(dm["x_train"].to_pandas())
    )
    tuner_context.report(metrics={"accuracy": accuracy}, model=model)

tuner = tune.Tuner(
    train_func=train_func,
    search_space={
        "n_estimators": tune.uniform(50, 200),
        "max_depth": tune.uniform(3, 10),
        "learning_rate": tune.uniform(0.01, 0.3),
    },
    tuner_config=tune.TunerConfig(
        metric="accuracy",
        mode="max",
        search_alg=search_algorithm.BayesOpt(),
        num_trials=2,
        max_concurrent_trials=1,
    ),
)

tuner_results = tuner.run(dataset_map=dataset_map)
# Access the best result info with tuner_results.best_result
Copy

The expected output looks similar to this:

accuracy  should_checkpoint  trial_id   time_total_s  config/learning_rate  config/max_depth  config/n_estimators
1.0       True               ec632254   7.161971      0.118617              9.655             159.799091

The tuner_results object contains all results, the best model, and the best model path.

print(tuner_results.results)
print(tuner_results.best_model)
print(tuner_results.best_model_path)
Copy

API overview

The HPO API is in the snowflake.ml.modeling.tune namespace. The main HPO API is the tune.Tuner class. When instantiating this class, you specify the following:

  • A training function that fits a model

  • A search space (tune.SearchSpace) that defines the method of sampling hyperparameters

  • A tuner configuration object (tune.TunerConfig) that defines the search algorithm, the metric to optimize, and the number of trials

After instantiating Tuner, call its run method with a dataset map (which specifies a DataConnector for each input dataset) to start the tuning process.

For more information, execute the following Python statements to retrieve documentation on each class:

from snowflake.ml.modeling import tune

help(tune.Tuner)
help(tune.TunerConfig)
help(tune.SearchSpace)
Copy

Limitations

Bayesian optimization works only with the uniform sampling function. Bayesian optimization relies on Gaussian processes as surrogate models, and therefore requires continuous search spaces. It is incompatible with discrete parameters sampled using the tune.randint or tune.choice methods. To work around this limitation, either use tune.uniform and cast the parameter inside the training function, or switch to a sampling algorithm that handles both discrete and continuous spaces, such as tune.RandomSearch.

Troubleshooting

Error message

Possible causes

Possible solutions

Invalid search space configuration: BayesOpt requires all sampling functions to be of type ‘Uniform’.

Bayesian optimization works only with uniform sampling, not with discrete samples. (See Limitations above.)

  • Use tune.uniform and cast the result in your training function.

  • Switch to RandomSearch algorithm, which accepts both discrete and non-discrete samples.

Insufficient CPU resources. Required: 16, Available: 8. May refer to CPU or GPU. The numbers of required and available resources may differ.

max_concurrent_trials is set to a value higher than the available cores.

The full error message describes several options you can try.