Cortex AI Function Studio

Cortex AI Function Studio features a Cortex Code Skill for creating, evaluating, and optimizing production-ready Cortex AI Functions for unstructured data workflows. It provides a structured development lifecycle that automates prompt engineering, model selection, evaluation, and optimization.

Cortex AI Function Studio provides two primary interfaces for authoring, evaluating, and optimizing AI Functions:

  • Cortex Code CLI: A command-line experience built for AI and Data Engineers, supporting scriptable workflows, agentic task definition, and rapid iteration within development environments.
  • Snowsight AI Studio (Guided): A native Snowflake UI built for Analysts and Data Scientists that provides a guided, no-code experience for creating, benchmarking, optimizing, and deploying AI Functions.

Create

Getting started is as simple as prompting Cortex Code in Snowsight or the Cortex Code CLI:

/cortex-ai-function-studio

This command initiates the AI Function Studio workflow and guides you through the full lifecycle of creating, evaluating, and optimizing a custom AI Function.

The Cortex AI Function Studio guides you through the full lifecycle of custom AI functions.
The intended workflow is create  evaluate  optimize.

During creation, you choose how to build: Direct (simple AI_COMPLETE call) or Agent
Research (I research and propose approaches with SQL pre/post-processing  you can also
specify your own strategy). After building, evaluate against labeled data, then optimize
with automated function body optimization and model selection.

If you're new to custom AI functions, you can start with a demo to see a worked example
end-to-end.

Define task: Users specify the AI function’s objective, including the task description, expected inputs, and desired output format (for example: summaries, structured JSON, classifications, or generated answers).

Note

AI Function Studio supports multimodal workflows based on model availability, including text, documents, images, audio, and video inputs.

Cortex AI Function Studio Define task Questions form with database and task fields

The AI Function Studio automatically selects a model for the task, though you can override the selection. In this example, because the staged files are PDFs, the system infers that a multimodal, document-capable model is required.

Now I have all the context I need. Your PDFs are in @my_docs stage  these are document
files, so I need a model that supports PDFs. Per the multimodal reference, the best
document models are: gemini-2.5-flash > gemini-3.1-pro > claude-sonnet-4-5.

The system and user prompts used in your Custom AI Function are fully transparent. At this stage, the prompt has not yet been evaluated or optimized against your test data. That evaluation occurs after the function is created.

As part of the creation workflow, AI Function Studio automatically generates and runs smoke tests to validate the function behavior. For example, smoke tests can automatically validate that the function returns outputs in the expected structure.

Snowsight worksheet showing the CREATE_AI_FUNCTION stored procedure call that registers the SUMMARIZE_PDF_THEMES function

Once the function is registered, it can be used like any other Cortex AI Function!

Snowsight worksheet showing a SELECT query calling the registered SUMMARIZE_PDF_THEMES function and its results table

Custom AI Functions created using Cortex Code in Snowsight or via the Cortex Code CLI are visible in the Snowsight AI & ML » AI Functions page.

Snowsight AI Functions page showing a list of Custom AI Functions

Evaluate

Evaluation and optimization are optional steps in the AI Function Studio workflow. After the function is created, the Studio guides you through available next steps such as testing, evaluation, and optimization. These workflows can also be revisited later at any time.

Cortex AI Function Studio Next step Questions form with Evaluate selected

AI Function Studio benchmarks candidate function configurations against representative datasets to measure accuracy, consistency, and overall performance. Depending on the data available, there are three evaluation paths:

  • Labeled Dataset (Ground Truth): If you already have a dataset with known expected outputs, AI Function Studio uses it as the evaluation baseline to calculate accuracy and quality metrics.
  • Label Generation: If you have input data but no labeled outputs, AI Function Studio can automatically generate evaluation labels using a state-of-the-art reasoning model. By default, the system selects the most capable available model for label generation and can recommend alternatives when needed.
  • Synthetic Dataset Generation: If no evaluation dataset exists, AI Function Studio can generate synthetic evaluation data based on the task definition. The system creates representative examples and expected outputs to bootstrap benchmarking and optimization workflows.

Once the evaluation dataset is prepared, AI Function Studio applies configurable evaluation metrics to compare candidate prompts, models, and function configurations. You can select the evaluation strategy and metrics that best align with your use case.

For example, AI Function Studio recommends using LLM as a judge for a document summarization task:

For context, since your function identifies themes from PDF documents (an open-ended task),
llm_judge is the recommended metric  it uses an LLM to judge whether the predicted themes
are correct/complete compared to the expected output. The other metrics (exact_match,
fuzzy_match, etc.) are better suited for classification or fixed-answer tasks.
Cortex AI Function Studio Questions form showing evaluation metric options with llm_judge selected

After the evaluation completes, AI Function Studio generates a detailed results summary, including insights into low-scoring records to support human-in-the-loop review and analysis.

Evaluation Results panel showing function name, metric, test size, score, and key findings

Optimize

AI Function Studio includes a managed AI optimization engine that automatically improves function quality using advanced optimization techniques such as the Genetic-Pareto Algorithm. Rather than relying on manual prompt engineering, the optimizer systematically explores and evaluates alternative prompts, models, and workflow strategies to improve accuracy and overall performance.

Prompt iterations

AI Function Studio supports multiple optimization budgets that control how extensively the system searches for improvements to your AI Function. Higher budgets explore a broader range of prompt, model, and workflow variations to maximize quality:

BudgetBest forOptimization behaviorExample use cases
demo (2 iterations)Quick validation and workflow previewsPerforms a lightweight sanity check with minimal experimentation. Useful for validating the end-to-end workflow or previewing optimization behavior, but unlikely to uncover major improvements.Demo environments, prototype validation, smoke testing, initial workflow verification
light (6 iterations)Simple, well-defined tasksEvaluates a focused set of prompt and function-body variations. Best when small refinements are likely to produce meaningful gains.Sentiment classification, spam detection, language detection, yes/no validation, simple text categorization
medium (12 iterations)Multi-step or nuanced tasksExplores a broader range of optimization strategies, including alternative prompt structures and pre/post-processing approaches. Provides a balanced tradeoff between runtime, cost, and optimization quality.Theme extraction from documents, named entity extraction, multi-label classification, structured Q&A, formatted summarization workflows
heavy (18 iterations)Complex, high-value production workloadsConducts a deeper search across the optimization space, including advanced prompt restructuring and workflow modifications. Best for maximizing quality in production-critical systems.Legal contract analysis, medical record extraction, policy-based routing, multi-stage reasoning pipelines, context-aware PII redaction

Model selection

You can choose how many models available in your account AI Function Studio should evaluate and optimize against your baseline function.

Selecting more models increases evaluation time and overall compute cost, since each prompt iteration is executed independently for every selected model. In addition, operating on multimodal files (including but not limited to PDF, MP3, or MP4 files) further increases evaluation time.

For example, using the medium optimization budget (~12 iterations) with 6 selected models results in each model processing approximately 6-7 evaluation records across all 12 optimization iterations. This enables the system to benchmark multiple prompt and workflow variations across different model families and cost/performance tiers.

Cortex AI Function Studio Models Questions form with six models selected across cost tiers

After optimization completes, AI Function Studio generates a comparative analysis showing how different models and prompt optimization strategies performed against the evaluation dataset. This allows teams to quantify quality improvements, compare model tradeoffs, and identify the best-performing configuration for their specific workload.

In the example below, iterative optimization improved overall summarization quality across multiple models, with Gemini 2.5 Flash achieving the strongest performance.

Bar chart comparing Seed and Optimized theme_overlap scores across four models

AI Function Studio can also help explain why certain models outperform others for a specific workload. By analyzing evaluation results, model characteristics, and document patterns, the system provides actionable insights into quality differences across models.

Yes, there are several reasons gemini-2.5-flash outperforms the other models on your
German-language PDF magazines:

  Massive context window (1M tokens)  Your PDFs are 4-13 MB multi-page magazine
    documents. Gemini can process the entire document at once, while Claude models have a
    200K token context that may truncate longer documents, losing themes from later
    pages.

  Strong multilingual capabilities  Gemini models have excellent German language
    understanding. Your fluter magazines are entirely in German, and Gemini handles
    reading German text in PDF layout (columns, sidebars, captions) very well.

  Native document processing  Gemini's document handling (up to 1,000 pages for flash,
    3,000 for pro) is more mature for PDFs. It natively parses the visual layout, whereas
    Claude models may struggle with complex magazine layouts with mixed text, images, and
    sidebars.

  Consistency — For the theme_overlap metric, what matters is producing the same themes
    across runs. Gemini-2.5-flash appears more deterministic in its theme extraction from
    these specific documents.

The reason gemini-3.1-pro scored lower (13.8%) despite also being Gemini is unclear — it
may be that the Pro model's larger capacity leads to more diverse/creative theme
interpretations that diverge more from the expected baseline.

Once an optimized configuration is selected, the AI Function can be deployed with a single click. As new models become available, teams can rerun optimization workflows independently without rebuilding the entire function pipeline end-to-end.

Cortex AI Function Studio Next step form after optimization completes with Done selected

Evaluation metrics guidelines

Choose the evaluation metric that best matches your task type:

CategoryMetricUse cases
Rule-basedExact MatchUses straightforward, case-insensitive string comparison to check whether an output exactly matches the expected result. This approach works best for strict classification tasks where precision matters.
Rule-basedFuzzy MatchRelies on token-level similarity to compare outputs, making it tolerant of small spelling differences or minor character variations. It’s a good fit when approximate matches are acceptable.
Rule-basedContains MatchLooks for the presence of a specific substring within the output, which makes it especially useful for tasks like information extraction or keyword detection.
SemanticLLM-as-a-JudgeUses a reference language model to evaluate whether two pieces of text are meaningfully equivalent. This allows for more nuanced scoring in complex tasks such as summarization, translation, or other open-ended generation.
CustomizedCustom MetricsCustom metrics automatically generated by the AI Function Studio’s agentic engine to align with the unique objectives and success criteria of your task. Ideal when standard approaches such as Exact Match or LLM-as-a-Judge are insufficient, enabling sophisticated, task-specific evaluation logic.

Cost considerations

  • Development phase: Authoring, evaluation, and optimization are billed by two parts:

    • The tokens processed by the models used during the experimentation process.
    • Cortex Code usages.
  • Production phase: Once registered, a Custom AI Function is billed according to the underlying models it uses. There is no additional surcharge for the function abstraction itself.

    To monitor and control costs, we recommend:

To get the number of tokens consumed by your custom AI function, issue the following query:

SELECT
    m.value:key:CUSTOM_AI_FUNCTION_NAME::STRING AS func_name,
    m.value:key:metric::STRING AS metric_type,
    SUM(m.value:value::NUMBER) AS token_number
FROM SNOWFLAKE.ACCOUNT_USAGE.CORTEX_AI_FUNCTIONS_USAGE_HISTORY c,
    LATERAL FLATTEN(input => c.METRICS) m
WHERE c.START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP())
    AND func_name ILIKE '%<my_ai_function_name>%'
GROUP BY 1, 2
ORDER BY func_name DESC;