Cortex AI Functions: Audio¶
Cortex AI Audio provides advanced LLM-powered audio processing capabilities, including:
Transcription: Convert spoken language to text.
Speaker identification: Determine who is speaking in each part of a multi-speaker audio file.
Timestamp extraction: Identify the timestamp of each spoken word.
These capabilities are available through the AI_TRANSCRIBE function. Because AI_TRANSCRIBE is managed and hosted inside Snowflake, you can easily integrate audio processing into your data workflows without onerous setup or infrastructure management.
Note
The AI_TRANSCRIBE function also processes audio tracks in video files.
AI_TRANSCRIBE¶
AI_TRANSCRIBE is a fully managed SQL function that transcribes audio and video files stored in a stage, extracting text, timestamps, and speaker information. See Create stage for media files for information on creating a stage suitable for storing files for processing by AI_TRANSCRIBE.
Under the hood, AI_TRANSCRIBE orchestrates optimized AI models for transcription and speaker diarization, processing audio files of up to two hours in length. AI_TRANSCRIBE is horizontally scalable, allowing efficient batch processing by processing multiple files at the same time. Audio can be processed directly from object storage to avoid unnecessary data movement.
By default, AI_TRANSCRIBE converts audio files to clean, readable text. You can also specify a timestamp granularity to extract timestamps for each word or change of speaker. Word-level timestamps are useful for applications such as subtitles or for letting the user to jump to specific parts of the audio by clicking words in the transcript. Speaker-level timestamps are useful for understanding who said what in meetings, interviews, or phone calls.
Timestamp granularity mode |
Result |
|---|---|
Default |
Transcription of entire audio file in one piece |
Word |
Transcription with timestamps for each word |
Speaker |
Indicates who is speaking, and a timestamp, at each change of speaker |
Supported languages¶
AI_TRANSCRIBE supports the following languages, which are automatically detected. Files can contain multiple supported languages.
Note
Language detection requires audio to begin within the first five seconds of the file. For best results, trim excess silence before uploading.
Arabic
Bulgarian
Cantonese
Catalan
Chinese
Czech
Dutch
English
French
German
Greek
Hebrew
Hungarian
Indonesian
Italian
Japanese
Korean
Latvian
Norwegian
Polish
Portuguese
Romanian
Russian
Serbian
Slovenian
Spanish
Swedish
Thai
Turkish
Ukrainian
Supported media formats¶
AI_TRANSCRIBE supports the following audio and video file formats:
Audio |
FLAC, MP3, MP4, OGG, WAV, WEBM |
|---|---|
Video |
MKV, MP4, OGV, WEBM |
Video files must contain at least one audio track in FLAC, MP3, OPUS, VORBIS, or WAV format.
Examples¶
Text transcription¶
The following example transcribes an audio file stored in the
financial_consultation stage, returning a text transcript of the entire file. The
TO_FILE function converts the staged file to a file reference.
Response:
Word-level segmentation with timestamps¶
Set the timestamp granularity to “word” to extract precise timestamps for every word spoken, enabling searchable, navigable transcripts.
Note that this audio file is in Spanish.
Response:
Note
The output is truncated for brevity. The full output contains a segment for each word spoken in the audio file.
Speaker recognition¶
Set timestamp granularity to “speaker” to detect, separate, and identify unique speakers in conversations or meetings.
This example uses an audio file an audio file with two speakers,
one speaking English and the other Spanish.
Response:
Note
The output is truncated for brevity. The full output contains a segment for each conversational “turn” in the audio file.
Use with other AI Functions¶
Call transcript analysis¶
You can pass the output of AI_TRANSCRIBE to other AI Functions for further processing. For example, you can use
AI_SUMMARIZE to summarize the transcription, or AI_CLASSIFY to classify the content of the transcription. This example
uses AI_SENTIMENT and AI_COMPLETE to analyze the text transcribed from
customer call audio and provide sentiment on four dimensions
and an assessment of the agent.
Note
AI_SENTIMENT analyzes only text and does not consider speech characteristics like tone of voice.
AI_SENTIMENT response:
AI_COMPLETE response:
Video transcript analysis¶
The following example transcribes a video file stored in the podcast_videos_S3 stage,
Response:
Once you have the transcript, you can use AI_COMPLETE to perform additional analysis. This example identifies retail brands mentioned in the conversation for use in advertising or sponsorship analytics.
Response
Cost considerations¶
Billing for all AI Functions is based on token consumption. For transcription, each second of audio processed is 50 tokens, regardless of language or segmentation method. A full hour of audio is therefore 180,000 tokens. Assuming that processing a million tokens costs 1.3 credits, and that Snowflake credits cost US $3 each, each hour of audio processed costs about US $0.702. This estimate is subject to change. For current pricing information, see the Snowflake Service Consumption Table.
Note
AI_TRANSCRIBE has a minimum billing duration of 1 minute. Files shorter than 1 minute are still processed, but are billed at 1 minute. To efficiently process large numbers of short audio files, consider batching them into a single file and using timestamps to identify the start and end of each original file in the resulting transcription.