Skip to content

Audio To Text Node - Make Voice Transcriptions Effortless

The Audio to Text node is a powerful AI component that enables you to convert speech into written text within your automation flows. By leveraging Automatic Speech Recognition (ASR), this node processes audio inputs and generates accurate text outputs across multiple supported languages. Configuration options enable you to define input sources, specify output formats, and seamlessly integrate transcription into broader workflows, allowing for transcription as part of larger, multi-step processes such as call analysis, feedback collection, or content indexing.

Key Capabilities

  • Automatic Speech Recognition (ASR): Converts spoken audio into accurate written text using OpenAI Whisper-1 for transcription.

  • Multilingual Transcription and Translation Support: Supports transcription in multiple languages and translation into English.

  • Audio Input Handling: Accepts audio files or audio URLs as input, making it suitable for both real-time and recorded speech processing.

  • Large File Handling with Size Limits: Supports audio files up to 25 MB. Larger files can be split at logical points to avoid mid-sentence breaks and ensure smooth, timely, and accurate transcriptions.

  • Text Output Generation: Generates clean, structured text in multiple formats based on prompt instructions, ready for downstream use such as summarization, translation, or storage.

Common Use Cases

  • Meeting and Lecture Transcription: Automatically convert meetings, interviews, or classroom sessions into searchable text.

  • Customer Support Automation: Transcribe voice interactions to feed into chatbots or help desk workflows.

  • Subtitle and Caption Generation: Generate accurate subtitles for video content across platforms.

  • Voice Command Processing: Convert spoken commands into text for use in voice-enabled applications.

  • Audio-Based Translation: Transcribe audio for translation into other languages as part of a multilingual workflow.

Example Use Case

The Audio to Text node processes uploaded customer service call recordings and generates transcribed (and optionally translated) text based on configured parameters and output instructions. These transcriptions can be used to assess conversation quality, evaluate agent performance, and support audits or training efforts. By removing the need for manual transcription or external APIs, the node offers a faster, more efficient, and fully integrated solution for audio processing within your workflow.

How It Works

The Audio to Text Node integrates seamlessly into your tool flows, accepting audio inputs, whether as files or URLs from previous nodes and passing the transcribed text to subsequent nodes. You can configure parameters such as the processing model, translation preferences, timestamp inclusion, and prompt instructions to tailor the transcription process to your specific needs. The node supports both static and dynamic inputs via context variables, making it highly adaptable for a wide range of voice-driven automation scenarios.

how audio to text works

In this document, you will learn how to add the node to your flow, configure it with audio inputs and transcription settings, manage outputs such as text or translated content, and test the results within your workflow.

Selection of the Audio File

You can add audio input in one of the following ways:

  1. Manually select and upload an audio file in the allowed format.
  2. Configure the Input variable by selecting Text for Type in the following window when adding input variables for the node. Learn more.

You must provide the audio file URL when running the flow, as mentioned here.

select text type input

Note

Uploading audio files as input variables is not supported only URLs are supported.

Supported Audio Formats

The following audio file formats are supported by the node:

  • M4a
  • Mp3
  • Webm
  • Mp4
  • Mpga
  • Wav
  • Mpeg

Note

Using other formats will result in a system error.

Audio File Size Limits

  • Maximum supported file size: 25 MB.
  • For larger files, split them into 25 MB or smaller segments and upload them to prevent input processing (transcription) and output generation delays.
  • Maintain context and avoid mid-sentence breaks when splitting files.

Processing Model

The Agent Platform uses OpenAI Whisper-1 for transcription.

Use Cases

This node is commonly used for:

  • Transcribing meetings, interviews, or lectures.
  • Automating customer service chatbots.
  • Generating subtitles for videos.
  • Voice command processing for applications.
  • Audio translation.

Translation

  • Transcribes and translates speech in non-English languages (see Open AI Whisper-supported language) into English when enabled.
  • Inverse translation (English to other languages) is not currently supported.

Important Considerations

Note

Be mindful of the environment where you upload the files - Host URLs that work fine in the Agent Platform may not work in GALE.

  • OpenAI Whisper automatically removes offensive and banned words during transcription.
  • Performance tracking is available under Settings > Model Analytics Dashboard > External Models tab. Learn more.

Metrics include:

  • Minutes transcribed/Minutes of Audio (total audio processed by the node) since the Whisper models are charged based on the minutes of the audio consumed.
  • Input and output tokens since the Whisper models usually support a small number of tokens, and tracking the counts is necessary. Learn more.
  • Each model execution is logged on the Model Traces page, displaying summarized data for:
    • Input, Output, and Response Time
    • Translation, and Timestamp. Learn more.

Steps to Add and Configure the Node

To add and configure the node, follow the steps below:

Note

Before proceeding, you must add an external LLM to your account using either Easy Integration or Custom API integration.

  1. Log in to your account and click Tools under Agent Platform Modules. access tools

  2. Click the Tools tab on the top navigation bar, and select the tool to which you want to add the node. The Tool flow page is displayed. click tool name

  3. Click Go to flow to edit the in-development version of the flow.
    access tool flow

  4. In the flow builder, click the + icon for Audio to Text under AI in the Assets panel. Alternatively, drag the node from the panel onto the canvas. You can also click AI in the pop-up menu and click Audio to text. add node

  5. Click the added node to open its properties dialog box. The General Settings for the node are displayed. properties dialog

  6. Enter or select the following General Settings:

    • Node Name: Enter an appropriate name for the node. For example, “CustomerSupportConversation.”
    • Provide the input variable that is set for the node for the Audio File field. Learn more.
    • Select a model from the list of configured models.
    • (Optional) Turn on the toggle for the following to enable the respective feature:
      • Translation: Translate other languages supported by the model to English.
      • Timestamps: The date and time at which each dialog was spoken.
    • Provide the instructions that you want the model to follow for Prompt. User prompts define specific questions or requests for the model. Provide clear instructions for the model to follow, using context variables for dynamic inputs in the recommended syntax: {{context.variable_name}}. For example, you can store the conversation transcript in a variable named “conversation” and pass it on in the prompt using {{context.conversation}}. You may include simple instructions regarding the style of the transcription, correct words or proper nouns, in case the model could not figure out what the spoken word was, fix punctuations, add context, and more.

    Note

    Whisper models process up to 224 tokens in the input prompt and ignore any input exceeding this limit.

    Standard Error

    When the Model is not selected, the prompt details are not provided, or both, the error message “Proper data needs to be provided in the LLM node” is displayed.

    • Response JSON schema: Define a JSON schema for structured responses. This step is optional and depends on the selected model.
      You can define a JSON schema to structure the model's response if the chosen model supports the response format. By default, if no schema is provided, the model will respond with plain text. Supported JSON schema types include: String, Boolean, Number, Integer, Object, Array, Enum, and anyOf. Ensure the schema follows the standard outlined here: Defining JSON schema. If the schema is invalid or mismatched, errors will be logged, and you must resolve them before proceeding.
      For more information about how the model parses the response and separates keys from the content body, see: Structured Response Parsing and Context Sharing in Workflows.
  7. Click the Connections icon and select the Go to Node for success and failure conditions. click connections

  • On Success > Go to Node: After the current node is successfully executed, go to a selected node in the flow to execute next, such as an AI node, Function node, Condition node, API node, or End node.
  • On Failure > Go to Node: If the execution of the current node fails, go to the End node to display any custom error message from the Audio to Text node.
  1. Finally, Test the flow and fix any issues found.

Configure and Test the Flow for the Node

Step 1: (Optional) Add Input Variable(s)

  1. Click the Input tab of the Start node, and click Add Input Variable to configure the input for the flow’s test run. Learn more.

add input variable

  1. Select Text for the Type field in the Enter input variable window to define a text input variable.
  2. Click Save.
  3. select text and save

Add all the required input variables to run the flow in the Input section of the Start node.

Step 2: Add Output Variable(s)

  1. Click the Output tab for the Start node.
  2. Click Add Output Variable.

click add output variable

  1. Enter the value for Name (key) and select String for Type to generate the transcribed text output.
  2. Click Save. Learn more about accessing the node’s output.
  3. save output variable

Step 3: Run the Flow

To run and test the flow, follow the steps below:

  1. Click the Run Flow button at the top-right corner of the flow builder. click run button

  2. (Optional) Add the value for Input Variable if you have configured it to test the flow. Otherwise, go directly to the next step.

generate output

  1. Click Generate Output.

The Debug window generates the flow log and results, as shown below. Learn more about running the tool flow.

debug window