How to Create Captions and Make Your Web Content Accessible *

What Are Captions and Why Do They Matter?


What are Captions?  


Captions are text versions of the spoken word presented within multimedia to provide web accessibility.

Captions allow the content of web audio and video to be accessible to those who do not have access to audio.  Captioning is primarily intended for those who cannot hear the audio.  However, it also helps those that can hear audio content, those who may not be fluent in the language in which the audio is presented, those for whom the language spoken is not their primary language, etc.

Common web accessibility guidelines require that captions be:

  • Synchronized– the text content should appear at approximately the same time that audio would be available
  • Equivalent– content provided in captions should be equivalent to that of the spoken word
  • Accessible– caption content should be readily accessible and available to those who need it


Captions unlock your web content ensuring it is accessible to anyone who requires captioning on video content

Why Are Captions Crucial to Web Accessibility?


Whenever multimedia content is present on the web, synchronized and equivalent captions are required to meet accessibility guidelines.  Both visual and auditory web content must be captioned.  Audio and video played through multimedia players and HTML5 video would both require captioning.  It can also apply to Flash or Java technologies when audio content is part of a multimedia presentation.


The Real-time Captioning Dilemma


Web multimedia is increasingly used to deliver real-time, live content over the Internet.  This ranges from video conferencing, to VoIP (Voice-over-Internet Protocol), to live video streaming.  Accessibility standards require that equivalent alternatives be provided for all audio and visual content to ensure the content is accessible to everyone.


Captions on a web media player are provided for accessibility as required by law

For real-time web multimedia, this means that visual content must be provided in an audible form and that audible content must be provided in a visual form.  The equivalents must also be synchronized with the presentation, meaning that they must be delivered to the end user at the same time as the main content.  This means that captions for audio must display at the same time that the audio would be heard.


Accessibility Requires an Audio Description for Visual Content


Web Accessibility requires an alternative to visual content in standard web media often takes the form of audio descriptions.  Where visual content that is not also provided in the audio stream of the multimedia is described by a narrator or other person.  Audio descriptions are very difficult to incorporate into real-time web broadcasts.  As an alternative, you can simply ensure that any visual content is natively described in the audio.

For instance, if there is a person speaking on the video, they could audibly describe any additional visual content that is displayed in the movie.  This removes the need for a separate form of audio description.

This is the only feasible way to make live web broadcasts that include visual information accessible to individuals who are blind or have low vision.  If produced with this in mind, and if those involved in the video broadcast are aware of and provide these descriptions, then the multimedia will be accessible to these audiences.



Synchronized Web Text Captions


The alternative to audible content in standard web media is usually synchronized captions.  Captions provide a textual equivalent of all audible information.  The difficulties in generating real-time captions are:

  1. Audio information must be converted into text in real time.
  2. The text captions must be delivered to the end user so they are synchronized with the audio.


Generating Real-Time Text Captions


Converting audio information into text in real time is difficult.  There are few typists capable of typing fast enough to transcribe the spoken word.  Thus, there are two primary technologies used to do this.


Stenography/Real-time transcription


Stenography involves having a trained transcriptionist that uses a special steno machine to transcribe the spoken word to a text format in real-time.  The steno machine has fewer keys than a typical keyboard. Rather than typing each letter, a stenographer hits key sequences on the steno machine to represent phonetic parts of words or phrases, or special codes representing words.  Software then analyzes the phonetic information and forms words.  Such technology allows a trained transcriptionist to generate text versions of audible conversation in real time.

Stenography allows the audible information to be converted to text in real time.  While accuracy levels are high, it is common to have words incorrectly typed or interpreted by the steno software.  Also, real-time transcription can be expensive, usually costing around $70-$120 USD per hour.


Voice Recognition Generated Captions


While voice recognition offers great possibilities for real-time generation of captions, the technology is not yet at a level where it can be used to do so.  In certain settings, such as when one person is speaking and is using voice recognition software that is well-trained, then voice recognition may be a viable option.  Even in such settings, however, there are weaknesses, such as a lack of punctuation, poor accuracy, and inability for other speakers to be captioned.

While voice recognition technology is improving and promises future multi-user, highly accurate, speaker independent voice recognition, at this time, its feasibility in generating text for use in captions is isolated to few situations.


Real-time Caption Delivery


As soon as the text equivalent of the audio has been generated, that text must be delivered to the end user so it is synchronized with the audio stream.  Unfortunately, few real-time multimedia technologies have native support for captioning.  Thus, the real-time captions must usually be delivered through a different technology running parallel to the multimedia software or hardware.  This is often done through dedicated applications or through clients that are built into a web page and run in a web browser.

For video conferencing and voice chats, where the audio is delivered in real time, the captions must be generated, converted into a format for broadcast across the Internet, and then delivered to the end user – all in real time.  For streaming video, there is often a delay between when the media is captured and when it displays to the end user, often due to encoding and buffering.  In these cases, the delivery mechanism for the real-time captions must provide functionality for ensuring that the captions display at roughly the same time that the audio would be heard, even if the delay between caption generation and delivery is a long time.


Web Accessibility Requires Captioned Multimedia


Open captions are similar to, and include the same text, as closed captions.  However, the captions are a permanent part of the video picture, and cannot typically be turned off.   Open captions are not decoded by the television set, but are a part of the video information.   This typically requires a video editing or encoding program that allows you to overlay titles onto the video.

The captions are visible to anybody viewing the video clip and cannot be turned off.  This gives you total control over the way the captions appear, but can be very time consuming and expensive to produce.  This technique allows for more control over caption location, size, color, font, and timing.


For web video, captions can be open, closed, or both.  Closed captions are most common, utilizing functionality within video players and browsers to display closed captions on top of or immediately below the video area.

The most common forms of web multimedia – Flash and HTML5 Video – both support captioning.  Older technologies, such a Windows Media Player, QuickTime, and RealPlayer also support captioning.  The formats and techniques for authoring and implementing captions may vary based on the technology used.


Wrapping Up Web Captioning


While captioning real-time web multimedia is not always easy, it is possible and should always be done when real-time multimedia is being delivered.  Fortunately, the technologies are improving to a level that allows real-time captioning to be both easy and financially viable in most situations.

The technologies used to provide real-time captions over the web are not limited to providing those captions as an alternative to web-based multimedia only.  Such caption delivery systems can also be used to provide captions for non-web-based technologies such as radio, television, video conferencing, etc.  This will ensure accessibility to all forms of live, real-time multimedia.


For Web Accessibility Captions Matter


Real-time web accessibility multimedia captioning is not always easy.  It should always be done when real-time multimedia is being delivered.  Technologies are improving to soon allow real-time captioning to be both easier and more financially viable in most situations.

Today’s technologies are not limited to providing captions as an alternative for web-based multimedia only.  These caption delivery systems can also provide captions for non-web-based technologies.  This will ensure accessibility to all forms of live, real-time multimedia


Accessibility Toolbar