Skip to content

Introduction to live streams with AI closed captions and speech translations

This section is part of the AI Live Streaming Manual. It explains:

  • How it is possible that Clevercast can deliver live streams with 99+% accurate AI-generated closed captions and speech translations.
  • How Clevercast allows you to increase accuracy to 100%.
  • Why Clevercast delivers live streams with extra latency, thus making it a perfect solution for (most) live streams, but not for online meetings.
  • How on-site event participants can watch the transcription in real-time, without any latency.

Terminology

Closed captions (or captions, for brevity) are a synchronized textual representation of the audio content, in-sync with a live or VoD stream, allowing viewers to read along as they watch the video. They are typically displayed at the bottom of the video player window (or screen, when fullscreen viewing) and can be turned on or off by the viewer. If closed captions are available in multiple languages, the viewer can select his/her preferred language.

AI speech translations (or AI speech, for brevity) denotes the use of human-like, synthetic voices that are generated by Clevercast to provide audio translations, in-sync with a live or VoD stream. At any time, viewers can listen to any of the translations in the video player or switch back to the original audio of the live stream. There are many terms for this - each emphasizing slightly different aspects - like voice dubbing, voice-over, live dubbing, language revoicing, synthetic voices, real-time voice replacement, AI audio translation and AI simultaneous interpretation.

Unique AI technology for high accuracy

AI captions and speech with 99+% accuracy

Clevercast's technology for AI-generated closed captions and speech translations leverages the latency that comes with the HTTP Live Streaming protocol. By slightly increasing it, we achieve a triple goal:

  • Longer audio streams can be sent to the Automatic Speech Recognition (ASR) engine. This allows the ASR engine to better interpret the words and construct correct phrases and sentences. This results in much higher accuracy than other AI solutions.
  • Clevercast has slightly more time to process the ASR input and output. This allows us to make the captions and/or speech translations more understandable, and to improve the formatting of closed captions.
  • Optionally, a human editor can make real-time corrections to the automatically generated captions (= result of the speech-to-text conversion), before they are translated and added to the live stream as captions and/or speech translations.

In addition, Clevercast uses the best language models, does pre- and post-processing to improve the quality of closed captions and speech translations. All closed captions and speech translations are always kept in-sync with the live stream.

Closed captions are rendered with intelligent formatting (making sure that captions and lines are broken at logical points, have the right length ...) and in a readable manner (as correct sentences, adding capitalization and punctuation ...), in accordance with key standards and guidelines.

Clevercast uses the text of the closed captions as the basis for AI speech translations. For each language, the text undergoes additional tuning to obtain the ideal source for text-to-speech conversion (e.g. longer text fragments result in more intelligible speech).

Getting to 100% accuracy

Clevercast's 99+% accuracy for commonly spoken languages is vastly superior to what other AI-powered captions solutions. However, we are take it one step further by providing the tools and/or services to reach 100% accuracy:

  • Vocabularies: Clevercast lets you create an Vocabulary for each event. A vocabulary lets you add terms (names, acronyms, industry jargon, technical phrases ...) that may appear in the live stream (or VoD), so they will be correct in the AI captions and speech translations.
  • Real-time Correction: human editors can correct speech-to-text conversion errors in real-time, making sure that the source for captions and speech translations is flawless. Clevercast provides an interface where you can do this yourself as a customer. Or we can find professional correctors to do this for your event.
  • Auto-detection & manual selection of the spoken language: if multiple languages are spoken in the live stream, Clevercast can automatically detect which language is being spoken and switch to the appropriate language model. Alternatively, you can make these adjustment yourself.

Latency

One-to-many live streams (not to be confused with meeting technology) always have a certain latency (see our general event management guide for more info). Since Clevercast increases this latency for live streams with AI-generated captions and/or speech translations, they are delivered with more delay than others. Without this delay, the AI conversion would take place while the sentences are still being spoken, leading to less accuracy.

When using AI generated captions and/or speech translations, you can expect the live stream to have a delay of 60 to 120 seconds. This is the sum of the time required for a number of processes, which varies depending on the choices you make for your live event:

  • The latency inherent in the HTTP live streaming (HLS) protocol. You can select low-latency if you only need closed captions, but we don't recommend it (since this will result in less accuracy, and a real-time corrector (optional) having less time to edit the AI-generated text).
  • The time needed for an optimal speech-to-text conversion.
  • The time needed for real-time correction (optional). Clevercast allows a real-time corrector to edit the result of the AI speech-to-text conversion, also used as the source for audio languages.
  • The time needed by AI to generate speech translations. If you don't use AI speech translations, this time won't be part of the delay.

Note: iOS devices (iPhone, iPad) allow latency to grow to a maximum of 2 minutes. Any live stream (also live streams from YouTube, Facebook ...) can therefore have a delay of 2 minutes on these devices.

Your viewers will not notice any of this. But as administrator in Clevercast, you should start the live stream earlier (at least 2 minutes before the action starts) and stop it later (at least 2 minutes after the action ends, but because of iOS devices it is better to wait 4 minutes). For more info, see the steps to manage an AI event.

Note: you can also using human subtitling to create the initial closed captions, instead of AI speech-to-text. In that case, only the regular HLS latency applies, even when combined with AI translation.

Transcription without latency

Clevercast is a live streaming solution, not a meeting solution. It is not suited to provide back and forth interpretation for participants in a meeting.

However, we do have a solution without latency for event participants (on-site and hybrid). Clevercast can provide them with transcripts of what is being said, without any delay. This way, participants who speak another language or are hard of hearing can follow along in real time with what is being said (e.g. by reading the transcript on a large screen or smartphone).

Simplicity

Using AI closed captions and speech translations greatly simplifies your workflow. You don't need human interpreters, special encoders or extra services. All you have to do is this:

  • Create an event in Clevercast: choose your audio and/or closed caption languages. Embed our player in your website or third-party platform (or use our webinar solution).
  • Broadcast to Clevercast via RTMP or SRT (a fully redundant setup is possible).
  • When the action is about the begin, press the Start button on your Clevercast event page.
  • After the action has ended, press the Stop button on your Clevercast event page.
  • Afterwards, you can download the cloud recording, which includes the speech translations and closed captions. Or you can publish it as a multilingual Video on-Demand.