snowflake.snowpark.DataFrameAIFunctions.similarity¶
- DataFrameAIFunctions.similarity(input1: Union[snowflake.snowpark.column.Column, str], input2: Union[snowflake.snowpark.column.Column, str], *, output_column: Optional[str] = None, **kwargs) snowflake.snowpark.DataFrame [source]¶
Compute similarity scores between two columns using AI-powered embeddings.
This method computes a similarity score based on the vector cosine similarity of the inputs’ embedding vectors. Supports both text and image similarity.
- Parameters:
input1 – The first column (Column object or column name as string) for comparison. Can contain text strings or images (FILE data type).
input2 – The second column (Column object or column name as string) for comparison. Must be the same type as input1 (both text or both images).
output_column – The name of the output column to be appended. If not provided, a column named
AI_SIMILARITY_OUTPUT
is appended.**kwargs –
Configuration settings specified as key/value pairs. Supported keys:
model: The embedding model used for embeddings. For text input, defaults to ‘snowflake-arctic-embed-l-v2’. For image input, defaults to ‘voyage-multimodal-3’. Supported models include:
Text: ‘snowflake-arctic-embed-l-v2’, ‘nv-embed-qa-4’, ‘multilingual-e5-large’, ‘voyage-multilingual-2’, ‘snowflake-arctic-embed-m-v1.5’, ‘snowflake-arctic-embed-m’, ‘e5-base-v2’
Images: ‘voyage-multimodal-3’
- Returns:
A new DataFrame with an appended output column containing similarity scores. The scores range from -1 to 1, where higher values indicate greater similarity.
Examples:
>>> # Text similarity between two columns >>> from snowflake.snowpark.functions import col >>> df = session.create_dataframe( ... [ ... ["I love programming", "I enjoy coding"], ... ["The weather is nice", "It's raining heavily"], ... ["Python is great", "Python is awesome"], ... ], ... schema=["text1", "text2"] ... ) >>> result_df = df.ai.similarity( ... input1="text1", ... input2="text2", ... output_column="similarity_score" ... ) >>> result_df.columns ['TEXT1', 'TEXT2', 'SIMILARITY_SCORE'] >>> results = result_df.collect() >>> results[0]["SIMILARITY_SCORE"] > 0.5 # Similar texts True >>> # Multilingual text similarity with custom model >>> df = session.create_dataframe( ... [ ... ["I love programming", "我喜欢编程"], # Same meaning in English and Chinese ... ["Good morning", "Buenas noches"], # Different meanings ... ], ... schema=["english", "other_language"] ... ) >>> result_df = df.ai.similarity( ... input1=col("english"), ... input2=col("other_language"), ... output_column="cross_lingual_similarity", ... model="multilingual-e5-large" ... ) >>> result_df.columns ['ENGLISH', 'OTHER_LANGUAGE', 'CROSS_LINGUAL_SIMILARITY'] >>> # Image similarity >>> from snowflake.snowpark.functions import to_file >>> # Upload images to a stage first >>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect() >>> _ = session.file.put("tests/resources/dog.jpg", "@mystage", auto_compress=False) >>> _ = session.file.put("tests/resources/cat.jpeg", "@mystage", auto_compress=False) >>> _ = session.file.put("tests/resources/kitchen.png", "@mystage", auto_compress=False) >>> # Create DataFrame with image pairs >>> df = session.create_dataframe( ... [ ... ["@mystage/dog.jpg", "@mystage/cat.jpeg"], # Animal comparison ... ["@mystage/dog.jpg", "@mystage/kitchen.png"], # Animal vs non-animal ... ], ... schema=["image1", "image2"] ... ) >>> result_df = df.ai.similarity( ... input1=to_file(col("image1")), ... input2=to_file(col("image2")), ... output_column="visual_similarity" ... ) >>> result_df.columns ['IMAGE1', 'IMAGE2', 'VISUAL_SIMILARITY'] >>> results = result_df.collect() >>> # Dog and cat (both animals) should be more similar than dog and kitchen >>> results[0]["VISUAL_SIMILARITY"] > results[1]["VISUAL_SIMILARITY"] True
Note
Both inputs must be of the same type (both text or both images)
AI_SIMILARITY does not support computing similarity between text and image inputs
- Similarity scores range from -1 to 1, where:
1 indicates identical or very similar content
0 indicates no similarity
-1 indicates opposite or very dissimilar content
This function or method is experimental since 1.39.0.