- Categories:
String & binary functions (AI Functions)
AI_ MULTI_ EMBED¶
Note
Video semantic search with AI_MULTI_EMBED is generally available to a limited number of customers.
Creates multimodal embeddings from text, an image, an audio file, or a video file. Embeddings are abstract numerical representations of the features of an input that can be used to determine the degree of similarity between inputs across modalities. Use AI_MULTI_EMBED for semantic search, scene retrieval, content indexing, similarity search, and other tasks that combine visual, audio, and text understanding.
Unlike AI_EMBED, which returns a single embedding vector, AI_MULTI_EMBED returns one or more embedding vectors together with metadata that describes each segment. For video and audio inputs, the function generates separate embeddings for different modalities (visual, audio, transcription) and for different segments of the asset.
The following table shows the regions where you can use the AI_MULTI_EMBED function:
| Data type | AWS US West 2 (Oregon) | AWS US East 1 (N. Virginia) | AWS Europe Central 1 (Frankfurt) | AWS Europe West 1 (Ireland) | AWS AP Southeast 2 (Sydney) | AWS AP Northeast 1 (Tokyo) | Azure East US 2 (Virginia) | Azure West Europe (Netherlands) | AWS (Cross-Region) |
|---|---|---|---|---|---|---|---|---|---|
| Text | ✔ | ✔ | |||||||
| Image | ✔ | ✔ | |||||||
| Audio | ✔ | ✔ | |||||||
| Video | ✔ | ✔ |
Syntax¶
Arguments¶
Required:
modelA string specifying the multimodal embedding model to use. You can provide the following value:
twelvelabs-marengo-embed-3-0
Supported models might have different costs.
inputThe content to generate embeddings from. Can be one of the following:
- A string containing the text to embed.
- A FILE object (created with TO_FILE) referencing an image, audio, or video file in a Snowflake stage.
The
twelvelabs-marengo-embed-3-0model supports embedding generation for transcribed speech and text queries in 36 languages. For the full list, see the TwelveLabs Marengo documentation.Supported file formats:
- Video:
.mp4,.mov,.avi,.mkv,.webm,.flv - Audio:
.mp3,.wav,.flac,.ogg - Image:
.jpg,.jpeg,.png
For the maximum supported duration and file size, see Limitations.
Optional:
optionsAn object containing zero or more of the following keys. The options apply only to video and audio inputs; they’re ignored for text and image inputs.
-
start_sec: A number that specifies the time point in seconds where embedding generation should begin. Minimum value: 0.Default: 0
-
end_sec: A number that specifies the time point in seconds where embedding generation should end. Must be greater thanstart_secplus the segment length, and no greater than the duration of the media.Default: Duration of the media
-
embedding_options: An array specifying which modalities to embed. Valid array members:visual— Visual embeddings from the video.audio— Embeddings of the audio track.transcription— Embeddings of the transcribed speech in the audio track.
Defaults:
- For video:
['visual', 'audio', 'transcription'] - For audio:
['audio', 'transcription']
-
embedding_scope: An array specifying the scope of the returned embeddings. Provide one or both of the following values:clip— Returns one embedding per segment (clip) of the asset.asset— Returns a single embedding that represents the entire asset.
Default:
['clip'] -
embedding_type: An array specifying how embeddings are aggregated across modalities. Valid array members:separate_embedding— Returns a separate embedding for each modality inembedding_options.fused_embedding— Returns a weighted fusion of the modalities inembedding_options. For video,fused_embeddingrequires at least two modalities inembedding_options. For audio,fused_embeddingrequires bothaudioandtranscriptioninembedding_options.
Default:
['separate_embedding'] -
use_fixed_length_sec: An integer that specifies the duration in seconds of each segment when using fixed-length segmentation. Range: 1 to 10.This option is mutually exclusive with
min_clip_sec. If neither is provided, video uses dynamic (shot-boundary) segmentation and audio uses fixed-length segmentation of approximately 10 seconds. -
min_clip_sec: An integer that specifies the minimum duration in seconds of each segment when using dynamic (shot-boundary) segmentation. Range: 1 to 5.Default: 4
This option is mutually exclusive with
use_fixed_length_sec. Dynamic segmentation applies only to video.
-
Returns¶
Returns an OBJECT with the following fields:
-
error: A string describing any error that occurred, or NULL if the call succeeded. -
value: An array of objects, one for each embedding. Each object contains the following fields:embedding: An array of floats representing the embedding vector. The Marengo Embed 3.0 model returns 512-dimensional vectors. Cast this value toVECTOR(FLOAT, 512)before using it with the vector similarity functions.embedding_option: The modality of the embedding. One ofvisual,audio,transcription, orfused. Not present for text or image inputs.embedding_scope: The scope of the embedding. One ofcliporasset. Not present for text or image inputs.start_sec: The starting timestamp of the segment in seconds. Not present for text or image inputs.end_sec: The ending timestamp of the segment in seconds. Not present for text or image inputs.
Access control requirements¶
You must use a role that has been granted the SNOWFLAKE.CORTEX_USER database role or the SNOWFLAKE.CORTEX_EMBED_USER database role to call this function. See Cortex LLM privileges for more information on granting one of these privileges.
Examples¶
Embed a text query¶
Generate an embedding for a text query. Use this pattern to embed a search phrase so you can compare it against embeddings stored for video or audio assets.
Embed an image¶
Generate an embedding for a staged image:
Embed a video¶
Generate multimodal embeddings for a staged video. With the default options, the model returns visual, audio, and transcription embeddings for each detected segment.
Example response:
Embed an audio file¶
Generate embeddings for a staged audio file:
Embed a video segment with custom options¶
Restrict embedding to the first 30 seconds of a video, request both clip-level and asset-level embeddings, and use fixed 5-second segments:
Embed a video with dynamic segmentation¶
Use dynamic (shot-boundary) segmentation with a 4-second minimum segment length:
Search video segments with a text query¶
Use AI_MULTI_EMBED to build a video search workflow. The following query flattens the embeddings returned for each video, stores one row per clip and modality, and then ranks clips by cosine similarity to a text query. For the full walkthrough, including how to load the source videos, see Cortex AI Functions: Multimodal.
Limitations¶
-
The
twelvelabs-marengo-embed-3-0model has the following input limits:- Maximum video or audio duration: 4 hours
- Maximum file size (default): 19 MB
- Maximum file size (allow-listed accounts using S3-staged input): 400 MB
- Maximum text length: 500 tokens
- Maximum image size: 3.75 MB
-
The model is available in AWS US East 1 (N. Virginia) only. Cross-region inference is supported.
-
The stage that contains the input files must use server-side encryption. Snowflake AI functions don’t work on FILE objects created from files in the following kinds of stages:
- Internal stages with encryption mode
TYPE = 'SNOWFLAKE_FULL' - External stages with any customer-side encrypted mode, such as
AWS_CSEorAZURE_CSE - User stage
- Table stage
- Stage with double-quoted names
- Internal stages with encryption mode
-
Processing files from stages is currently incompatible with custom network policies.
Legal notices¶
Refer to Snowflake AI and ML.