snowflake.snowpark.functions.ai_transcribe

snowflake.snowpark.functions.ai_transcribe(audio_file: Column, **kwargs) Column[source]

Transcribes text from an audio file with optional timestamps and speaker labels.

AI_TRANSCRIBE supports numerous languages (automatically detected), and audio can contain more than one language. Timestamps and speaker labels are extracted based on the specified timestamp granularity.

Parameters:
  • audio_file – A FILE type column representing an audio file. The audio file must be on a Snowflake stage that uses server-side encryption and is accessible to the user. Use the to_file() function to create a reference to your staged file.

  • **kwargs

    Configuration settings specified as key/value pairs. Supported keys:

    • timestamp_granularity: A string specifying the desired timestamp granularity. Possible values are:

      • ’word’: The file is transcribed as a series of words, each with its own timestamp.

      • ’speaker’: The file is transcribed as a series of conversational “turns”, each with its own timestamp and speaker label.

      If this field is not specified, the entire file is transcribed as a single segment without timestamps by default.

Returns:

A string containing a JSON representation of the transcription result. The JSON object contains the following fields:

  • audio_duration: The total duration of the audio file in seconds.

  • text: The transcription of the complete audio file (when timestamp_granularity is not specified).

  • segments: An array of segments (when timestamp_granularity is set to ‘word’ or ‘speaker’). Each segment contains:

    • start: The start time of the segment in seconds.

    • end: The end time of the segment in seconds.

    • text: The transcription text for the segment.

    • speaker_label: The label of the speaker for the segment (only when timestamp_granularity is ‘speaker’). Labels are of the form “SPEAKER_00”, “SPEAKER_01”, etc.

Note

  • Supports languages: Arabic, Bulgarian, Cantonese, Catalan, Chinese, Czech, Dutch, English, French, German, Greek, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Polish, Portuguese, Romanian, Russian, Serbian, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian.

  • Supported audio formats: FLAC, MP3, Ogg, WAV, WebM

  • Maximum file size: 700 MB

  • Maximum duration: 60 minutes with timestamps, 120 minutes without

Examples:

>>> import json
>>> # Basic transcription without timestamps
>>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect()
>>> _ = session.file.put("tests/resources/audio.ogg", "@mystage", auto_compress=False)
>>> df = session.range(1).select(
...     ai_transcribe(to_file("@mystage/audio.ogg")).alias("transcript")
... )
>>> result = json.loads(df.collect()[0][0])
>>> result['audio_duration'] > 120  # more than 2 minutes
True
>>> "glad to see things are going well" in result['text'].lower()
True

>>> # Transcription with word-level timestamps
>>> df = session.range(1).select(
...     ai_transcribe(
...         to_file("@mystage/audio.ogg"),
...         timestamp_granularity='word'
...     ).alias("transcript")
... )
>>> result = json.loads(df.collect()[0][0])
>>> len(result["segments"]) > 0
True
>>> result["segments"][0]["text"].lower()
'glad'
>>> 'start' in result["segments"][0] and 'end' in result["segments"][0]
True

>>> # Transcription with speaker diarization
>>> _ = session.file.put("tests/resources/conversation.ogg", "@mystage", auto_compress=False)
>>> df = session.range(1).select(
...     ai_transcribe(
...         to_file("@mystage/conversation.ogg"),
...         timestamp_granularity='speaker'
...     ).alias("transcript")
... )
>>> result = json.loads(df.collect()[0][0])
>>> result["audio_duration"] > 100  # more than 100 seconds
True
>>> len(result["segments"]) > 0
True
>>> result["segments"][0]["speaker_label"]
'SPEAKER_00'
>>> 'jenny' in result["segments"][0]["text"].lower()
True
>>> 'start' in result["segments"][0] and 'end' in result["segments"][0]
True
Copy