Skip to content

Introduction to live streams with closed captions


Clevercast supports closed captions for live streams, if this is included in your plan.

You can add (multilingual) closed captions to your live stream in several ways:

  • Transcription in real time: humans can use Clevercast to type the transcription and/or translation in real time using a browser, resulting in closed captions in the video player.
  • Speech-to-text with manual correction: The floor audio, assuming it consists of a single language, can be automatically converted into closed captions through Artificial Intelligence (AI). Before they are shown in the video player (or auto-translated into other languages), the captions can be corrected by a human editor using a browser. Such live stream has a total delay of about 120 seconds
  • Automatic translation: closed captions resulting from speech-to-text conversion and manual transcription can be automatically translated into closed captions in multiple languages.

When using speech-to-text, viewers will see the live stream with a two-minute delay. This is necessary to render the closed captions as accurately as possible. During the delay, the AI engine is able to ingest the entire context at the time of conversion, which in turn leads to it interpreting the words correctly and then placing them in a sentence. Without this delay, conversion would often have to occur before the speaker is finished with a sentence, leading to incorrect choices of words and phrases. This delay also makes it possible to manually correct the captions before they are auto-translated or shown in the stream.

When using transcription, with or without auto-translation, there is only the normal HTTP Live Streaming delay of about 18-30 seconds.

All types of closed captions can be combined with real-time audio translations through Translate@Home. Any number of languages is possible. However, combining speech-to-text and manual transcription in a single event is currently not possible.

Note that an event with closed captions currently has a maximum duration of 24 consecutive hours. The closed captions will keep appearing for 24 hours after you've set the event to Preview or Started. If your event spans multiple days, you should set the event to Inactive or Ended during breaks.

Accuracy of closed captions

The source from which closed captions are generated, strongly determines their accuracy. That’s why a distinction must be made between closed captions based on speech-to-text and on manual transcription.

If closed captions are generated through manual transcription, the accuracy depends on the human interpreter. If she does a good job, the source of the automatic translations will also be good, which will have a favorable impact on the accuracy of the auto-translated captions.

For closed captions resulting from speech-to-text conversion (directly or translated), the following elements determine the accuracy:

  • The clarity of the audio and of the speaker (e.g. articulation, speed, accent, dialect)
  • The speaker's language: speech-to-text conversion usually works better for common languages (e.g. English, Spanish).
  • Word usage: if many technical or infrequent words are used, this often has a negative effect. Names and abbreviations are also often not recognized. This can be improved by defining a speech context, which allows Clevercast to pass these words and phrases on to the speech-to-text engine.

Speech-to-text: correction interface and stream delay

Since speech-to-text conversion is never 100% accurate, we strongly recommend using the interface for real time correction of the generated closed captions. This allows you to edit the closed captions just before they are shown in the video player, so possible errors can still be corrected. See the Speech-To-Text Correction manual on how to use this interface.

If you use speech-to-text, your live stream will have a total delay between 120 and 140 seconds (this includes the HLS latency). This is necessary to allow for the 30 seconds of correction time, but also because speech-to-text services (currently) still need a relatively long time to obtain an optimal result. Because of this, the accuracy of closed captioning in Clevercast is much higher than in any other speech-to-text solutions on the market, making life much easier for correctors.

Current limitations

Speech-to-text is currently not ideal for a live stream with speakers in different languages, since the speech-to-text service expects the source language (and dialect) to be set in advance. In a future version, it may become possible to adjust this during a live stream.

We expect the accuracy of speech-to-text and translation to improve considerably in the future. The speech-to-text services used by Clevercast are constantly evolving. We may also add new modules for specific languages or features.

Chat with transcription and correction rooms

Clevercast allows event managers to chat with transcribers or correctors while the event status is Preview, Started or Paused. On the event page, click the Manage Language Rooms’ button. When you get to the page, first press the ‘Connect’ button to connect to the rooms and enter your name. You can then chat with people in separate rooms or send messages to all rooms at once.

The player on this page has the same delay as the video player in the transcription room (= no delay) or in the correction room (= delay after speech-to-text conversion). It currently doesn’t allow you to see closed captions, though. In order to check the closed captions you depend on the Preview player on the event page, where the live stream with closed captions has a 2 minute delay.