ML Jobs in Data Clean Rooms¶
Overview¶
ML Jobs enables collaborators to run complex, resource-intensive machine learning workflows within Collaboration Data Clean Rooms. Instead of running Python code as UDFs or stored procedures on a standard warehouse, ML Jobs runs full ML workloads in an isolated container environment. This enables collaborators across advertising, retail media, and measurement to train models on combined data without exposing raw user-level data to any party.
ML Jobs extends the existing collaboration specification and template submission workflow. Template providers stage artifacts and register ML Jobs code specs, then submit templates for approval. Analysis runners review and approve the templates, then run ML Jobs on their own compute pool.
Benefits¶
- Simplified development: Stage Python scripts and specify pip dependencies. No need to build or manage Docker images.
- Distributed training: Run workloads across multiple nodes with GPU support for accelerated training and inference. Snowflake distributed trainers for XGBoost, LightGBM, and PyTorch scale automatically across compute pool nodes.
- Hyperparameter optimization: Use the built-in HPO API
(
snowflake.ml.modeling.tune) for Bayesian optimization, random search, and grid search. Run parallel trials with no additional dependencies. - Multi-file projects: Stage entire project directories with multiple modules, libraries, and model artifacts.
- Flexible compute: Select the compute pool size and instance family to match the workload.
- Job monitoring: Track progress through container logs, status checks, and result retrieval.
For more information about ML Jobs capabilities, see Snowflake ML Jobs.
Use cases¶
ML Jobs supports a range of use cases across advertising, retail media, financial services, and other industries:
- Lookalike modeling: Train classifiers on seed audiences and score users across impression and campaign data. Applicable to CTV, retail media networks (RMN), and digital advertising.
- Measurement and incrementality: Run sales lift studies and incrementality models by joining ad impressions with purchase or conversion data across collaborators. Use distributed HPO to optimize uplift model hyperparameters for each campaign automatically.
- Attribution modeling: Build multi-touch attribution models across campaign logs, impressions, and conversion events from multiple data sources without exposing raw user-level data.
- Propensity scoring: Score user populations for purchase propensity, churn risk, or lifetime value using ad exposure, engagement, and transaction features from multiple parties.
- Audience segmentation: Cluster and segment users using ML techniques on combined impression logs, CRM data, and behavioral signals from publishers, advertisers, and data providers.
- Distributed model training: Train large models using Snowflake distributed trainers (XGBoost, LightGBM, PyTorch) that automatically scale across multiple nodes and GPUs in the compute pool.
- Custom ML workflows: Run any containerized Python ML workload inside a secure clean room environment, including proprietary models, pre-trained model inference, and feature engineering on campaign and user data.
Requirements¶
- Both accounts in the collaboration must have the latest version of the Snowflake Data Clean Rooms environment installed.
- The analysis runner must have a compute pool available for running ML Jobs. For GPU-accelerated workloads, a GPU compute pool is required. GPU availability varies by region — see Compute pool errors in the troubleshooting guide.
ML Jobs code spec¶
An ML Jobs code spec defines one or more containerized ML workloads that run on a compute pool.
For the full field reference, including stage_code_dir requirements and image_tag options,
see Code specification — ML Jobs.
User flow: template provider¶
The template provider defines and submits the ML Job spec and the template for execution in the collaboration.
1. Stage artifacts¶
Stage your code, model files, and private libraries into an internal stage. The stage must have directory enabled and use Snowflake-managed encryption:
Upload your Python scripts using SnowSQL or the Snowflake CLI:
After uploading, refresh the stage directory so that the collaboration can access the files:
2. Register the ML Jobs code spec¶
Register an ML Jobs code spec that references the staged artifacts. The code spec must include an ml_jobs section
with at least one ML job definition:
For the full field reference, see Code specification — ML Jobs.
3. Register templates¶
Register a template for each step in your pipeline. Each template calls the ML job procedure generated from the code spec:
4. Submit the template for approval¶
Submit the template to the collaboration for approval using the standard template approval flow. The collaborator validates the code and artifacts using hashes to ensure compliance.
5. Update code as needed¶
As the code evolves, update the ML Jobs code spec and resubmit for approval.
User flow: analysis runner¶
The analysis runner reviews, approves, and executes the ML Job within the collaboration.
1. Review and approve the template¶
Review the template approval request and inspect the artifacts. Validate the code and artifacts using hashes to ensure compliance. Approve the pending template.
Note
If code review is needed, it should be done offline.
2. Set up the compute pool¶
ML Jobs run in containers on a compute pool instead of a warehouse. Create a compute pool for the installed clean room application:
3. Run the ML Job¶
Call COLLABORATION.RUN with the template and pass the compute pool name as a parameter. The call returns a job ID:
The template value is the template ID, which is the template name and version joined by an underscore. For example,
if you registered a template named my_train_template with version v1, the template ID is my_train_template_v1.
4. Monitor the ML Job¶
Use RUN_ML_JOB_ACTION to check logs, status, and results. Pass a YAML specification with the job ID returned by
COLLABORATION.RUN and the action to perform:
The action field is case-insensitive. Valid actions are:
get_statusReturns the current execution status of the job. Possible values:
PENDING,RUNNING,DONE,FAILED. Poll this action to know when the job completes.get_logsReturns the container’s stdout/stderr output. Use this to monitor progress, debug errors, or view script print statements. Only available if
allow_monitoringistruein the code spec and the collaboration owner has not disabledALLOW_ML_JOBS_MONITORING.get_resultReturns the job’s return value after completion. For scripts that write to cleanroom tables rather than returning data directly, this returns NULL. Check
get_statusfirst to confirm the job isDONEbefore callingget_result.
These actions map to the underlying Snowflake ML Jobs management API. For more details on job lifecycle and status values, see Snowflake ML Jobs.
Note
Log access (get_logs) is controlled by the ALLOW_ML_JOBS_MONITORING collaboration configuration, which is
true (enabled) by default. Only the collaboration owner can change it, and only the owner’s setting applies to the
collaboration. To disable or re-enable log access, the owner calls
SET_CONFIGURATION:
When disabled, get_logs returns an error, but get_status and get_result still work.
5. Activate results¶
Activate the output audiences or measurement metrics back to collaborators or third parties using the standard activation template flow.
Examples¶
- Lookalike audience modeling (ML Jobs option)
- Multi-party incrementality measurement with automated HPO
Tips and patterns¶
For troubleshooting common ML Jobs errors (template call pattern, compute pool issues), see Troubleshooting Collaboration Data Clean Rooms — ML Jobs.
Accessing collaborator data¶
Unlike UDF or procedure templates that execute inline SQL, ML Jobs scripts can’t reference cleanroom.source_table_0 directly because that view is not created for ML Jobs execution.
Instead, pass source table references into the container through the args parameter using OBJECT_CONSTRUCT:
In your Python script, read the source table references from the args JSON:
Writing results for activation¶
For activation to work, the scoring script must write results to a cleanroom table that the activation template reads.
Use session.create_dataframe(df).write.save_as_table("cleanroom.<table_name>", mode="overwrite") in the scoring
script, and reference the same table in the activation template:
Scaling with distributed training¶
For large datasets (tens of millions of rows or more), use Snowflake distributed trainers instead of single-node training. The ML Jobs container runtime includes a Ray cluster that distributed trainers use automatically.
When to use distributed training:
Distributed training helps scale out to multiple nodes when the memory or compute requirements to train a model are beyond what can be supported in a single compute pool node, covering either a high memory CPU node or a large GPU node.
Code change — replace raw XGBoost with the distributed trainer:
Template change — increase num_instances for multi-node:
The num_instances parameter controls how many compute pool nodes the job runs across. The distributed trainer
automatically discovers and uses all available nodes. Size the compute pool accordingly:
For more information, see Distributed training.