- description:
Migrating code with Cortex Code and the Snowpark Migration Accelerator
Migrating with Cortex Code¶
Cortex Code is the primary migration tool for any spark to snowflake migration. There are skills that are bundled directly with the Cortex Code CLI that can work alongside the Snowpark Migration Accelerator (SMA) or completely separately to migrate you to Snowflake.
- spark-migration: (User guide is on this page.) This skill is the primary orchestrator for any Spark to Snowflake migration.
- snowpark-connect - Spark to Snowpark Connect: Migrate scripts and notebooks with spark references to Snowpark Connect using the Cortex Code CLI. This skill is bundled as part of the spark-migration skill.
Spark Migration skill user guide¶
What this skill does¶
The spark-migration skill orchestrates an end-to-end migration of PySpark or Spark Scala code to Snowflake. It guides you through code conversion (LLM powered or with the SMA), issue tracking and resolution, and test scaffolding. It can also handle notebook conversion for notebooks with references to the Spark API.
Possible prerequisites¶
Before the skill begins, it checks for the following resources available locally:
- Git must be installed to run the validation component (auto-installs via Homebrew on macOS if missing)
- The SMA CLI is required if you choose the Snowpark API conversion path (the skill will search for it or ask you to provide the path)
Step-by-step walkthrough¶
Step 1: Load configuration¶
This skill creates a configuration file to save settings and project characteristics that you may want to keep from project to project. The skill will check for saved project configurations from previous runs automatically.
What happens:
- If saved configurations exist, you’ll see a numbered list and be asked whether to reuse one or create a new configuration.
- If no configurations exist, you’ll be asked for basic project information.
The information that is saved in a configuration file is as follows, but note that you can tell Cortex Code to “use default values for all fields” if you do not want to save a specific configuration.
You will be prompted for:
- Project Name (required): used as the config filename
- Source Code Path (your PySpark source)
- Output Folder (where converted code goes)
- Customer Email
- Customer Company
Step 2: Review configuration¶
A full summary of all the settings will be displayed, including defaults for optional parameters.
What happens:
- You see the complete configuration with current/default values.
- You choose “Use these settings” or “Edit settings”.
You will be prompted if:
- This is a first run (all parameters shown for you to fill in)
- You choose to edit (you only specify the numbers you want to change)
Key settings you can configure here:
| Setting | Options | Default |
|---|---|---|
| Conversion Type | snowpark-connect (Snowpark Connect) / snowpark_api (with the SMA CLI) | snowpark-connect |
| Migration Status | migrate (run conversion) / already_migrated (use existing output) | migrate |
| Run Notebook Migration | yes / no | yes |
| Run EWI Fixer | yes / no | yes |
| Run Stage Conversion | yes / no | yes |
| Run Validation (DVP) Orchestrator | yes / no | yes |
Step 3: Route based on migration status¶
No user interaction: the skill reads your configuration and routes automatically:
already_migrated→ Step 4 (validate existing output)migrate→ Step 5 (choose conversion tool)
Step 4: Validate existing output (only if already_migrated)¶
If you already have an output from this skill or from the SMA from a previous conversion, you can choose ‘already_migrated’. This will move you to the git setup step below.
What happens:
- The skill validates that
Output/andReports/directories exist at your output path. - It auto-detects SMA v1 format (
Conversion-*timestamped folders).
You will be prompted if:
- The output path is invalid: you’ll be asked to provide the correct one.
Step 5: Choose conversion tool (only if migrate)¶
You will be prompted to choose a conversion path if you did not specify one as part of the configuration file. This would be unusual.
You will be prompted for:
- Output path: where to save the converted code
- Conversion tool: two options:
- Snowpark API: uses the SMA CLI binary (requires installation)
- Snowpark Connect: uses the bundled snowpark-connect sub-skill (AI-driven conversion)
Step 6: Snowpark API conversion (if you chose that option)¶
What happens:
- The SMA CLI binary is located/validated.
- The conversion runs in the background (can take several minutes for large workloads).
- Progress is monitored and reported to you.
You will be prompted for (if not already configured):
- SMA CLI path (if not found automatically)
- Enable Jupyter Conversion? (Y/N)
- If you have embedded SQL, what source “Flavor” will that SQL be? (SparkSql / HiveSql / Databricks)
- Generate Checkpoints? (Y/N)
What to expect: The SMA CLI processes your script and notebook files, and produces converted Snowpark files in the output directory, along with CSV reports documenting all issues found during conversion.
Step 7: Snowpark Connect conversion (if you chose that option)¶
What happens:
- The bundled
snowpark-connectsub-skill is loaded. - It detects whether your code is Python or Scala and routes to the appropriate migration workflow.
- An AI-driven analysis and fix pipeline converts your code.
You will NOT typically be prompted: the sub-skill runs autonomously with the project information already collected.
What to expect: The sub-skill analyzes your code for compatibility issues, creates a conversion folder, applies fixes, updates imports/session creation, adds migration headers, and generates reports similar to what would be generated by the SMA (Issues.csv, etc.).
Step 8: Initialize Git and verify output¶
To perform the validation component, the skill will create a git repository or ask you to create one. This step may be skipped if you are not going to use the skill’s validation subskill.
What happens (automatic, no prompts):
- A Git repository is initialized at the resolved output directory.
- An initial commit captures the unmodified conversion output on the
mainbranch. - A
sma/migration-processbranch is created for all subsequent modifications. - The output structure (
Output/,Reports/Issues.csv) is verified.
You will be prompted if:
- The directory is already a Git repo with uncommitted changes: you’ll choose to stash, commit, or abort.
Step 9: Dashboard generation¶
This dashboard is generated from the reports created in the conversion step(s) above. This runs on a local python server that the skill will create.
What happens (automatic):
- The
sma-dashboard-generatorskill parsesReports/Issues.csv. - An interactive EWI (Errors, Warnings, Issues) tracking dashboard is generated.
- A local web server starts and opens the dashboard in your browser.
No user prompts. The dashboard opens automatically at http://localhost:8080 (or the next available port).
Step 10: Notebook migration¶
What happens:
- If configured as
yes(default): notebooks are scanned and converted automatically using the snowflake-notebooks-migration subskill that is bundled with the spark-migration skill. - If not configured: the skill will still scan for notebook files and prompt you to run the subskill.
You will be prompted if:
- The setting was not pre-configured AND notebooks are found: you’ll be asked whether to run notebook migration.
What to expect: Notebook files (.ipynb, .python, .scala, .sql, Databricks .py) are converted to Snowflake Workspace format in-place.
Step 11: EWI fixer¶
Automatically resolves conversion issues (EWIs) in the converted code using AI. This is necessary for Snowpark API runs, but is not necessary for Snowpark Connect runs. However, there still could be EWIs output by the snowpark-connect subskill.
Note that EWIs will be recorded in a report (the issues.csv file) and will be written as comments in the output code. You can choose whether you’d like to delete those inline comments in this step.
What happens:
- If configured as
yes(default): runs automatically with saved options. - If not configured: you’re asked whether to run it.
You will be prompted for (if not pre-configured):
- Run EWI Fixer? (Yes / No)
- EWI comment handling: Mark (keep comments with [FIXED]/[NOT-FIXED] prefix) or Remove (delete after fixing).
- Which EWIs to process: Only pending / Retry not_auto_resolved / Specific EWI code / All (reset)
What to expect: The fixer reads EWI comments in your converted files, attempts to resolve each one, and updates the SQLite database with results. The dashboard will reflect the updated status.
Step 12: Stage conversion¶
Replaces embedded file paths (s3://, hdfs://, etc.) with Snowflake stage references (@stage_name/...).
What happens:
- If configured as
yes(default): runs automatically with the configured stage name. - If not configured: you’re asked.
You will be prompted if:
- The setting was not pre-configured: you’ll be asked whether to replace embedded file paths.
What to expect: All cloud storage paths in your converted code are replaced with Snowflake internal stage references using the configured prefix (default: migration_stage).
Step 13: DVP orchestrator¶
Sets up a Data Validation Pipeline (DVP) workspace for testing the migrated code.
What happens:
- If configured as
yes(default): runs automatically. No prompt. - If configured as
no: skips entirely.
What to expect: The DVP orchestrator creates a dvp/ workspace and runs up to 8 sub-skills:
- Create DVP workspace structure
- Convert notebooks to scripts (if applicable)
- Generate Abstract Syntax Graph (ASG) from source files
- Identify entrypoints in the code
- Adapt code for testing
- Identify I/O schemas (inputs/outputs)
- Generate synthetic test data
- Generate test setup and register test suites
Step 14: Final dashboard and summary¶
What happens (automatic):
- The SMA Dashboard is reopened in your browser showing the final state (all EWI fixes, test registrations, etc.)
- A final summary table is displayed showing the status of every step
Where you’ll usually be prompted: quick reference¶
| Step | Prompt | When |
|---|---|---|
| 1 | Project name, source path, output path, email, company | First run or creating new config |
| 2 | “Use these settings” or “Edit settings” | Every run (with saved config) |
| 4 | Output path (if invalid) | Only if already_migrated and path is wrong |
| 5 | Output path, conversion tool choice | Only if migrate |
| 6 | SMA CLI path, Jupyter/SQL/Checkpoints options | Only if Snowpark API and not pre-configured |
| 8 | How to handle dirty git state | Only if existing repo has uncommitted changes |
| 10 | “Run Notebook Migration?” | Only if not pre-configured AND notebooks found |
| 11 | “Run EWI Fixer?”, comment mode, scope | Only if not pre-configured |
| 12 | “Run Stage Conversion?” | Only if not pre-configured |
On subsequent runs: If you’ve saved a configuration, most prompts are skipped. The skill uses your saved preferences and runs through the pipeline with minimal interaction.
Output structure¶
If you went through all the steps, upon completion your output directory will contain:
If you did not complete all the steps, only the artifacts related to the steps you executed will be present in the output.
Git branches¶
The skill maintains two branches:
main: the original, unmodified conversion output (your baseline)sma/migration-process: all fixes and modifications applied by subsequent steps
Re-running the skill¶
Configurations are saved per-project. On subsequent runs:
- You’ll see your saved config and can reuse it with one click
- All defaults from your previous choices are preserved
- You can selectively re-run individual sub-skills (EWI Fixer, Stage Conversion, etc.) independently by invoking them directly
Dashboard access¶
After the workflow completes, you can reopen the dashboard at any time: