- Categories:
File functions (AI Functions)
AI_TRANSCRIBE¶
AI_TRANSCRIBE is a fully managed SQL function that transcribes audio and video files stored in a stage, extracting text, timestamps, and speaker information. See Create stage for media files for information on creating a stage suitable for storing files for processing by AI_TRANSCRIBE.
Under the hood, AI_TRANSCRIBE orchestrates optimized AI models for transcription and speaker diarization, processing audio files of up to two hours in length. AI_TRANSCRIBE is horizontally scalable, allowing efficient batch processing by processing multiple files at the same time. Audio can be processed directly from object storage to avoid unnecessary data movement.
By default, AI_TRANSCRIBE converts audio files to clean, readable text. You can also specify a timestamp granularity to extract timestamps for each word or change of speaker. Word-level timestamps are useful for applications such as subtitles or for letting the user jump to specific parts of the audio by clicking words in the transcript. Speaker-level timestamps are useful for understanding who said what in meetings, interviews, or phone calls.
Timestamp granularity mode |
Result |
|---|---|
Default |
Transcription of entire audio file in one piece |
Word |
Transcription with timestamps for each word |
Speaker |
Indicates who is speaking, and a timestamp, at each change of speaker |
Syntax¶
Arguments¶
Required:
audio_fileA FILE type object representing an audio file. Use TO_FILE function to create a reference to your staged file.
Optional:
optionsAn OBJECT value containing zero or more of the following fields.
timestamp_granularity: A string specifying the desired timestamp granularity. Possible values are:"word": The file is transcribed as a series of words, each with its own timestamp."speaker": The file is transcribed as a series of conversational “turns,” each with its own timestamp and speaker label.
If this field is not specified, the entire file is transcribed as a single segment without timestamps by default.
return_error_detailsA BOOLEAN flag that indicates whether to return error details in case of error. When set to TRUE, the function returns an OBJECT that contains the value and the error message, one of which is NULL depending on whether the function succeeded or failed. See Error behavior for details.
Returns¶
An string containing a JSON representation of the transcription result. The JSON object contains the following fields:
"audio_duration": The total duration of the audio file in seconds."text": The transcription of the complete audio file, provided when thetimestamp_granularityfield is not specified."segments": An array of segments, provided when thetimestamp_granularityfield is set to"word"or"speaker". Each segment is a JSON object containing the following fields:"start": The start time of the segment in seconds."end": The end time of the segment in seconds."text": The transcription text for the segment."speaker_label": The label of the speaker for the segment, provided when thetimestamp_granularityfield is set tospeaker. Labels are of the form “SPEAKER_00”, “SPEAKER_01”, etc. and are assigned in the order speakers are detected in the audio file.
Error behavior¶
By default, if AI_TRANSCRIBE can’t process the input, the function returns NULL. If the query processes multiple rows, rows with errors return NULL and don’t prevent the query from completing.
The return value on error depends on the return_error_details
argument. The following table shows the return value based on the return_error_details argument:
return_error_detailsReturn value
Description
FALSENot passedNULL
TRUE
OBJECT with
valueanderrorfieldsvalue: A VARCHAR value containing the transcription result, or NULL if an error occurred.error: A VARCHAR value that contains the error message if an error occurred, or NULL if the function succeeded.
For more information about error handling for AI functions, see Snowflake Cortex AI Function: Multirow error handling improvements.
Access control requirements¶
Users must use a role that has been granted the SNOWFLAKE.CORTEX_USER database role. See Cortex LLM privileges for more information on this role.
Usage notes¶
AI_TRANSCRIBE supports the following audio and video file formats:
Audio
FLAC, MP3, MP4, OGG, WAV, WEBM
Video
MKV, MP4, OGV, WEBM
Video files must contain at least one audio track in FLAC, MP3, OPUS, VORBIS, or WAV format.
Factors such as sample rate, bit depth, and number of channels do not affect transcription, but they might make the file too large to process if they are too high. Internally, AI_TRANSCRIBE uses monophonic audio at 16 KHz, and resamples input files when they are not already in this format
The maximum audio file size is 700 MB.
The maximum audio file duration is 60 minutes when timestamp granularity is set to “word” or “speaker”. If timestamp granularity is not used, the maximum duration is 120 minutes.
Supported languages¶
AI_TRANSCRIBE supports the following languages, which are automatically detected. Files can contain multiple supported languages.
Note
Language detection requires audio to begin within the first five seconds of the file. For best results, trim excess silence before uploading.
Arabic
Bulgarian
Cantonese
Catalan
Chinese
Czech
Dutch
English
French
German
Greek
Hebrew
Hindi
Hungarian
Indonesian
Italian
Japanese
Korean
Latvian
Malay
Norwegian
Polish
Portuguese
Romanian
Russian
Serbian
Slovenian
Spanish
Swedish
Thai
Turkish
Ukrainian
Cost considerations¶
Billing for all AI Functions is based on token consumption. For transcription, each second of audio processed is 50 tokens, regardless of language or segmentation method. A full hour of audio is therefore 180,000 tokens. Assuming that processing a million tokens costs 1.3 credits, and that Snowflake credits cost US $3 each, each hour of audio processed costs about US $0.702. This estimate is subject to change. For current pricing information, see the Snowflake Service Consumption Table.
Note
AI_TRANSCRIBE has a minimum billing duration of 1 minute. Files shorter than 1 minute are still processed, but are billed at 1 minute. To efficiently process large numbers of short audio files, consider batching them into a single file and using timestamps to identify the start and end of each original file in the resulting transcription.
Examples¶
Text transcription¶
The following example transcribes an audio file stored in the
financial_consultation stage, returning a text transcript of the entire file. The
TO_FILE function converts the staged file to a file reference.
Response:
Word-level segmentation with timestamps¶
Set the timestamp granularity to “word” to extract precise timestamps for every word spoken, enabling searchable,
navigable transcripts. Note that this audio file is in
Spanish.
Response:
Note
The output is truncated for brevity. The full output contains a segment for each word spoken in the audio file.
Speaker recognition¶
Set timestamp granularity to “speaker” to detect, separate, and identify unique speakers in conversations or
meetings. This example uses an audio file with two
speakers, one speaking English and the other Spanish.
Response:
Note
The output is truncated for brevity. The full output contains a segment for each conversational “turn” in the audio file.
Call transcript analysis¶
You can pass the output of AI_TRANSCRIBE to other AI Functions for further processing. For example, you can use
AI_SUMMARIZE to summarize the transcription, or AI_CLASSIFY to classify the content of the transcription. This
example uses AI_SENTIMENT and AI_COMPLETE to analyze the text transcribed from
customer call audio and provide sentiment on four dimensions
and an assessment of the agent.
Note
AI_SENTIMENT analyzes only text and does not consider speech characteristics like tone of voice.
AI_SENTIMENT response:
AI_COMPLETE response:
Troubleshooting¶
If the function fails, it raises an error. Common error messages include:
Error Message |
Situation and Solution |
|---|---|
Invalid options object |
The option provided for the |
No response from server |
The audio file cannot be retrieved, perhaps because of an expired scoped URL. |
File too large. Maximum size is 734,003,200 Bytes, file exceeds this limit. |
The provided audio file exceeds the maximum file size. |
Invalid file format. Only [‘flac’, ‘mp3’, ‘ogg’, ‘wav’, ‘webm’] files are supported, or WebM file does not contain an audio stream. |
The audio file is not one of the supported formats, which are listed in the error message. WebM files support multiple media types, so make sure the file contains an audio stream. If the file is in a supported format, check that it is not corrupted. |
File will be too large after resampling it to 16000 Hertz. Expected size is 3,355,444,448,000.0 Bytes. |
The provided audio file is too large after resampling to 16 KHz. If the provided audio has a lower sample rate, its resampled size is larger than the original, and could potentially exceed the maximum allowed file size. |
Audio duration too long: 6052.10 seconds. Maximum allowed: 3600 seconds. or Audio duration too long: 7335.28 seconds. Maximum allowed: 7200 seconds. |
The provided audio file is too long. If you are using timestamp granularity, the maximum duration is 60 minutes (3600 seconds). |
Unsupported detected language |
The audio file contains a language that is not supported by AI_TRANSCRIBE. |
Regional availability¶
AI_TRANSCRIBE is available in the following regions:
AWS US West 2 (Oregon)
AWS US East 1 (N. Virginia)
AWS EU Central 1 (Frankfurt)
Azure East US 2 (Virginia)
Legal notices¶
Refer to Snowflake AI and ML.