Audio To Text Node - Make Voice Transcriptions Effortless¶

The Audio to Text node is a powerful AI component that enables you to convert speech into written text within your automation flows. By leveraging Automatic Speech Recognition (ASR), this node processes audio inputs and generates accurate text outputs across multiple supported languages. Configuration options enable you to define input sources, specify output formats, and seamlessly integrate transcription into broader workflows, allowing for transcription as part of larger, multi-step processes such as call analysis, feedback collection, or content indexing.

Key Capabilities¶

Automatic Speech Recognition (ASR): Converts spoken audio into accurate written text using OpenAI Whisper-1 for transcription.
Multilingual Transcription and Translation Support: Supports transcription in multiple languages and translation into English.
Audio Input Handling: Accepts audio files or audio URLs as input, making it suitable for both real-time and recorded speech processing.
Large File Handling with Size Limits: Supports audio files up to 25 MB. Larger files can be split at logical points to avoid mid-sentence breaks and ensure smooth, timely, and accurate transcriptions.
Text Output Generation: Generates clean, structured text in multiple formats based on prompt instructions, ready for downstream use such as summarization, translation, or storage.

Common Use Cases¶

Meeting and Lecture Transcription: Automatically convert meetings, interviews, or classroom sessions into searchable text.
Customer Support Automation: Transcribe voice interactions to feed into chatbots or help desk workflows.
Subtitle and Caption Generation: Generate accurate subtitles for video content across platforms.
Voice Command Processing: Convert spoken commands into text for use in voice-enabled applications.
Audio-Based Translation: Transcribe audio for translation into other languages as part of a multilingual workflow.

Example Use Case¶

The Audio to Text node processes uploaded customer service call recordings and generates transcribed (and optionally translated) text based on configured parameters and output instructions. These transcriptions can be used to assess conversation quality, evaluate agent performance, and support audits or training efforts. By removing the need for manual transcription or external APIs, the node offers a faster, more efficient, and fully integrated solution for audio processing within your workflow.

How It Works¶

The Audio to Text Node integrates seamlessly into your tool flows, accepting audio inputs, whether as files or URLs from previous nodes and passing the transcribed text to subsequent nodes. You can configure parameters such as the processing model, translation preferences, timestamp inclusion, and prompt instructions to tailor the transcription process to your specific needs. The node supports both static and dynamic inputs via context variables, making it highly adaptable for a wide range of voice-driven automation scenarios.

In this document, you will learn how to add the node to your flow, configure it with audio inputs and transcription settings, manage outputs such as text or translated content, and test the results within your workflow.

Selection of the Audio File¶

You can add audio input in one of the following ways:

Manually select and upload an audio file in the allowed format.
Configure the Input variable by selecting Text for Type in the following window when adding input variables for the node. Learn more.

You must provide the audio file URL when running the flow, as mentioned here.

Note

Uploading audio files as input variables is not supported only URLs are supported.

Supported Audio Formats¶

The following audio file formats are supported by the node:

M4a
Mp3
Webm
Mp4
Mpga
Wav
Mpeg

Note

Using other formats will result in a system error.

Audio File Size Limits¶

Maximum supported file size: 25 MB.
For larger files, split them into 25 MB or smaller segments and upload them to prevent input processing (transcription) and output generation delays.
Maintain context and avoid mid-sentence breaks when splitting files.

Processing Model¶

The Agent Platform uses OpenAI Whisper-1 for transcription.

Use Cases

This node is commonly used for:

Transcribing meetings, interviews, or lectures.
Automating customer service chatbots.
Generating subtitles for videos.
Voice command processing for applications.
Audio translation.

Translation¶

Transcribes and translates speech in non-English languages (see Open AI Whisper-supported language) into English when enabled.
Inverse translation (English to other languages) is not currently supported.

Important Considerations¶

Audio uploads and settings are handled by the File Upload API.
OpenAI Whisper automatically removes offensive and banned words during transcription.
Performance tracking is available under Settings > Model Analytics Dashboard > External Models tab. Learn more.

Metrics include:

Minutes transcribed/Minutes of Audio (total audio processed by the node) since the Whisper models are charged based on the minutes of the audio consumed.
Input and output tokens since the Whisper models usually support a small number of tokens, and tracking the counts is necessary. Learn more.
Each model execution is logged on the Model Traces page, displaying summarized data for:
- Input, Output, and Response Time
- Translation, and Timestamp. Learn more.

Add and Configure an Audio to Text Node¶

Note

Before proceeding, you must add an external LLM to your account.

Step 1: Open Flow Builder¶

Log in → In Agent Platform Modules → Click Tools.
Select your tool → Click Go to Flow.

Step 2: Add the Node¶

Click the "+" icon for Audio to Text under AI in the Assets panel. Alternatively, drag the node from the panel onto the canvas. You can also click AI in the pop-up menu and click Audio to text.

Step 3: Configure the Node¶

Click the added node to open its properties dialog box. The General Settings for the node are displayed.
Enter or select the following General Settings:
- Node Name: Enter an appropriate name for the node. For example, “CustomerSupportConversation.”
- Provide the input variable that is set for the node for the Audio File field. Learn more.
- Select a model from the list of configured models.
- (Optional) Turn on the toggle for the following to enable the respective feature:
  - Translation: Translate other languages supported by the model to English.
  - Timestamps: The date and time at which each dialog was spoken.
- Provide clear instructions you want the model to follow when processing the node in the Prompt field, such as specific questions or requests
You can use a hardcoded instruction like "Use direct speech and highlight words related to problems and challenges in the voice/audio file," when a static audio file URL input is provided in the Start node.

Alternatively, you can add dynamic prompt instructions referencing different audio file URLs from the Start node using the syntax: {{context.steps.Start.variable_name}}, where variable_name stores the audio file URL.

For example, if “ConversationFile” stores the audio file URL (passed dynamically during runtime), you can reference it in the prompt using the syntax {{context.steps.Start.ConversationFile}}, as shown below.

You may include simple instructions regarding the style of the transcription, correct words or proper nouns, in case the model could not figure out what the spoken word was, fix punctuations, add context, and more.

For example, "Use a clean verbatim transcription style by omitting filler words such as “um,” “uh,” or “you know.” Correct any misheard or unclear words, especially product names, company names, and technical terms. Ensure proper punctuation and sentence casing to make the transcript easy to read. If a word is not clear, mark it as "inaudible" with a timestamp. Add short speaker labels (Customer: and Agent:) and make light contextual corrections for grammar and clarity without altering the original meaning."

Note

Whisper models process up to 224 tokens in the input prompt and ignore any input exceeding this limit.

Standard Error

When the Model is not selected, the prompt details are not provided, or both, the error message “Proper data needs to be provided in the LLM node” is displayed.
- Response JSON schema: Define a JSON schema for structured responses. This step is optional and depends on the selected model.
  You can define a JSON schema to structure the model's response if the chosen model supports the response format. By default, if no schema is provided, the model will respond with plain text. Supported JSON schema types include: String, Boolean, Number, Integer, Object, Array, Enum, and anyOf. Ensure the schema follows the standard outlined here: Defining JSON schema. If the schema is invalid or mismatched, errors will be logged, and you must resolve them before proceeding.
  For more information about how the model parses the response and separates keys from the content body, see: Structured Response Parsing and Context Sharing in Workflows.
Click the Connections icon and select the Go to Node for success and failure conditions.
- On Success > Go to Node: After the current node is successfully executed, go to a selected node in the flow to execute next, such as an AI node, Function node, Condition node, API node, or End node.
- On Failure > Go to Node: If the execution of the current node fails, go to the End node to display any custom error message from the Audio to Text node.
Finally, run the flow and fix any issues found.

Test the Flow for the Node¶

Step 1: (Optional) Add Input Variable(s)¶

Click the Input tab of the Start node, and click Add Input Variable to configure the input for the flow’s test run. Learn more.
Select Text for the Type field in the Enter input variable window to define a text input variable.
Click Save.

Add all the required input variables to run the flow in the Input section of the Start node.

Step 2: Add Output Variable(s)¶

Click the Output tab for the Start node.
Click Add Output Variable.
Enter the value for Name (key) and select String for Type to generate the transcribed text output.
Click Save. Learn more about accessing the node’s output.

Step 3: Run the Flow¶

To run and test the flow, follow the steps below:

Click the Run Flow button at the top-right corner of the flow builder.
(Optional) Add the value for Input Variable if you have configured it to test the flow. Otherwise, go directly to the next step.
Click Generate Output.

The Debug window generates the flow log and results, as shown below. Learn more about running the tool flow.

Send Feedback