snowflake.snowpark.DataFrameAIFunctions.embed¶

DataFrameAIFunctions.embed(input_column: Union[snowflake.snowpark.column.Column, str], model: str, *, output_column: Optional[str] = None) → snowflake.snowpark.DataFrame[source]¶

Generate embedding vectors from text or images.

This method creates dense vector representations (embeddings) of text or images, which can be used for similarity search, clustering, or as features for machine learning.

Parameters:

input_column – The column (Column object or column name as string) containing the text or images (FILE data type) to embed.
model –
The embedding model to use. Supported models:
For text embeddings:
- snowflake-arctic-embed-l-v2.0: Arctic large model (default for text)
- snowflake-arctic-embed-l-v2.0-8k: Arctic large model with 8K context
- nv-embed-qa-4: NVIDIA embedding model for Q&A
- multilingual-e5-large: Multilingual embedding model
- voyage-multilingual-2: Voyage multilingual model
For image embeddings:
- voyage-multimodal-3: Voyage multimodal model (only for images)
output_column – The name of the output column to be appended. If not provided, a column named AI_EMBED_OUTPUT is appended.

Returns:

A new DataFrame with an appended output column containing VECTOR embeddings.

Examples:

>>> # Text embeddings with default model
>>> df = session.create_dataframe([
...     ["Machine learning is fascinating"],
...     ["Snowflake provides cloud data platform"],
...     ["Python is a versatile programming language"],
... ], schema=["text"])
>>> result_df = df.ai.embed(
...     input_column="text",
...     model="snowflake-arctic-embed-l-v2.0",
...     output_column="text_vector"
... )
>>> results = result_df.collect()
>>> # Verify we got embeddings
>>> all(len(row["TEXT_VECTOR"]) > 0 for row in results)
True

>>> # Multilingual text embeddings
>>> from snowflake.snowpark.functions import col
>>> df = session.create_dataframe([
...     ["Hello world"],
...     ["Bonjour le monde"],
...     ["Hola mundo"],
...     ["你好世界"],
... ], schema=["greeting"])
>>> result_df = df.ai.embed(
...     input_column=col("greeting"),
...     model="multilingual-e5-large",
...     output_column="multilingual_vector"
... )
>>> results = result_df.collect()
>>> # All greetings should have embeddings
>>> all(len(row["MULTILINGUAL_VECTOR"]) > 0 for row in results)
True

>>> # Image embeddings
>>> from snowflake.snowpark.functions import to_file
>>> # Upload images to a stage first
>>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect()
>>> _ = session.file.put("tests/resources/dog.jpg", "@mystage", auto_compress=False)
>>> _ = session.file.put("tests/resources/cat.jpeg", "@mystage", auto_compress=False)
>>> df = session.read.file("@mystage")
>>> result_df = df.ai.embed(
...     input_column="file",
...     model="voyage-multimodal-3",
...     output_column="image_vector"
... )
>>> results = result_df.collect()
>>> # Both images should have embeddings
>>> all(len(row["IMAGE_VECTOR"]) > 0 for row in results)
True

Copy

Note

Embeddings can be used with vector similarity functions to find similar items
Different models produce embeddings of different dimensions
For best results, use the same model for all items you want to compare

This function or method is experimental since 1.39.0.