Skip to content

Introduction to live streams with AI closed captions and audio translations

Clevercast supports multilingual AI closed captions (aka subtitles) and audio translations (aka AI dubbing, synthetic voices) for live streams, if this is included in your plan.

You can use Clevercast to add (multilingual) closed captions to your live stream in several ways:

  • Automatic captions through speech-to-text conversion (AI), with optional real-time correction : the audio in the live stream is automatically converted into closed captions through Artificial Intelligence (AI). Optionally, the captions can be corrected by a human editor before they are shown in the video player (and auto-translated into other languages).
  • Automatic multilingual captions through machine translation (AI): closed captions resulting from speech-to-text conversion and manual transcription can be automatically translated into closed captions in multiple languages.
  • Transcription in real time: humans can use Clevercast to type the transcription and/or translation in real time using a browser, resulting in closed captions in the video player. The human transcription can be used as source for AI translation to closed captions in different languages.

For each closed caption language, Clevercast lets you add a simultaneous AI audio translation to your live stream. Clevercast uses the text of the closed captions as a basis to generate the synthetic voices in real-time and add them to the live stream.

See our FAQ for the list of currently supported AI closed caption languages and AI audio languages.

Note: Clevercast also supports Remote Simultaneous Intepretation by human interpreters, which can be combined with AI interpreters in the same live stream. For example, you can use a human interpreter for one language, and AI interpreters for other languages.

Accuracy and latency

Accuracy

Clevercast has a unique live AI technology to capture the audio of a live stream and convert it to text (for closed captions), translate it, and convert it back to speech (for audio translations). This results in much higher accuracy than other solutions.

One-to-many live streams (not to be confused with meeting technology) always have a certain latency. To revolutionize the accuracy of AI generated captions and audio translations, Clevercast has slightly increased this latency. This achieves several goals, including:

  • Longer audio fragments can be sent to the Automatic Speech Recognition (ASR) engine. This allows the ASR engine to better interpret the words and construct correct phrases and sentences, which results in a much more accurate conversion to text.
  • Clevercast has slightly more time to process the ASR input and output. This allows us to improve the readability of the captions and make them easier to understand.
  • Optionally, a human editor can make real-time corrections to the AI generated captions, before they are translated, used as source for audio translations and added to the live stream.

Latency

The flipside of this is that live streams with AI-generated languages have a higher delay than others. Without this delay, the AI conversion would take place while the sentences are still being spoken, leading to incorrect word and sentence choices.

When using AI generated languages, you can expect the live stream to have a delay of approximately 120 seconds. This is the sum of the time required for a number of processes, which varies depending on the choices you make for your live event:

  • The latency inherent in the HTTP live streaming (HLS) protocol - currently the de-facto standard for live streaming to desktop and mobile - which is approximately 16 to 30 seconds (normal latency) or 8 to 12 seconds (low-latency). You can select low-latency for your live stream, but this also results in a real-time corrector having less time to edit the AI-generated text.
  • The time needed for an optimal speech-to-text conversion.
  • The time needed for real-time correction (optional). Clevercast allows a real-time corrector to edit the result of the AI speech-to-text conversion, also used as the source for audio languages.
  • The time needed by AI to generate audio translations. If you don't use AI audio translations, this time won't be part of the delay.

Note: iOS devices (iPhone, iPad) allow latency to grow to a maximum of 2 minutes. Any live stream (also live streams from YouTube, Facebook ...) can therefore have a delay of 2 minutes on these devices.

Your viewers will not notice any of this. But as administrator in Clevercast, you should start the live stream earlier (at least 2 minutes before the action starts) and stop it later (at least 2 minutes after the action ends, but because of iOS devices it is better to wait four minutes). For more info, see the steps to manage an AI event.

Note: when you are using human transcription to create the initial captions (instead of speech-to-text) combined with AI translation, only the regular HLS latency applies.

Simplicity

This way of simultaneously interpreting and generating closed captions greatly simplifies your workflow. You don't need human interpreters, special encoders or extra services. All you have to do is this:

  • Create an event in Clevercast: choose your audio and/or closed caption languages. Embed our player in your website or third-party platform (or use our webinar solution).
  • Broadcast to Clevercast via RTMP or SRT (a fully redundant setup is possible).
  • When the action is about the begin, press the Start button on your Clevercast event page.
  • After the action has ended, press the Stop button on your Clevercast event page.
  • Afterwards, you can download the cloud recording, which includes the audio translations and closed captions. Or you can publish it as a multilingual Vidoe on-Demand.

Tools to perfect AI captions and audio languages

Clevercast makes it possible to further improve the quality of AI captions and audio translations. These tools are optional.

AI vocabularies

You can set an AI vocabulary. In this vocabulary you can add terms (names, acronyms, industry jargon, technical phrases ...) that may appear in the live stream, so they will be correctly displayed in the captions and pronounced in the audio translations. You can also add custom translation terms, to override the AI translations of these terms.

Real-time correction

The Real-time Correction Room lets you edit the closed captions, which are the result of speech-to-text conversion, in real-time. This way, you can still make changes to the text of closed captions and synthetic voices, just before they are added to the live stream. The correction room is easy to use for first-time users. But it also includes features that allow advanced users to add new terms to the AI vocabulary during the live stream (to avoid repetitive actions). Watch our video tutorial to see how it works.

This also requires human intervention, adding extra cost and complexity to your project. However, the advantage is that you ensure a high accuracy for all closed captions and audio languages, without separate interpreters and/or transcribers for each language. You can do this yourself or, if you wish, can contact us for professional correctors.

Real-time management page

The Real-time Management room allows you to:

  • watch the incoming stream without delay. Unfortunately, it is not yet possible to see closed captions in this real-time player. To see your captions, you have to resort to the Preview player on the event page.
  • if multiple languages are spoken in your live stream, you can use it to change the selected speech-to-text language during the live stream (currently not supported if your event includes AI audio translations)
  • if you use AI captioning, the 'Pause AI Captioning' button allows you to temporarily stop the AI captioning while you are broadcasting.
  • if you use correctors, transcribers or interpreters, this page lets you see if they are online and communicate with them through text chat.

Duration

A live stream with AI captions or audio languages currently has a maximum duration of 24 consecutive hours. The captions and audio languages will keep appearing for 24 hours after you've set the event to Preview or Started. If your event spans multiple days, you should set the event to Inactive or Ended during breaks.

The maximum duration of a cloud recording is also 24 hours. Again, you can simply reset the event status to start a new recording.