Welcome to the jungle: caption and subtitle formats in video streaming

As the latest release of the Unified Streaming Platform adds full WebVTT support for offline and dynamic packaging, all major delivery formats for text tracks in HTTP video streaming are now supported. To help you choose the right format, this blog post provides a guide through the maze of their profiles and specifications. From WebVTT to TTML and from embedded to burned-in subtitles.

While sound and moving images are the two kinds of media content that most people will associate with a successful video streaming setup, the importance of text should not be overlooked. Serving captions and/or subtitles along with your video content might be required by law, or you might want to make your content more accessible in order to broaden your audience.

However, adding captions and subtitles to streaming video is anything but a straightforward affair. It starts with the distinction between captions and subtitles, because: what is the difference anyway?

Whereas the term 'captions' generally communicates a text track for the hard of hearing, containing descriptions of both dialog and audio, the term 'subtitles' usually signifies a text track that contains a translation of the dialog. [1]

Having said that, the distinction is culturally dependent and less clear in practice, with the term 'captions' being 'American' and 'subtitles' being more 'European'. For the purpose of readability, this text will stick with 'subtitles' and only use 'captions' when the context calls for it.


From 'Monty Python and the Holy Grail'

Unfortunately, confusion does not end here. The variety of formats and profiles for text tracks is dizzying. Even worse: their specifications are defined in such broad ways that anything seems possible, whereas in the field, support for many of the specified features is limited.

Providing some form of guidance through the jungle of text track formats and profiles, this blog post will shine a light on the most prominent ones. To start of easy: 'only' four methods to carry text tracks for HTTP video streams are truly distinct. Below, the two listed first are native to the web, while the latter are legacy approaches:

To end this introduction on a high note before diving into the specifics of each of the formats above: Unified Streaming Platform (USP) supports all of the text track formats listed, with broad support for WebVTT added since the latest release (as detailed in the release notes of version 1.7.31). Specific information about USP's compatibility with each format is presented after the in-depth description of the format (in italic).

So now, the real stuff. Take one last deep breath and prepare yourself to be submerged in a wondrous world of specifications, constraints, extensions, and, above all: text.

TTML

XML-based subtitles were developed more than a decade ago by the W3C Timed Text working group. Their specification has become a W3C recommendation and is most often referred to as TTML (Timed Text Markup Language)[2]

Inside a TTML XML-file that does not make use of any advanced features, the part that contains the actual subtitles will look something like this:

<body>
  <div>
      <p begin="00:00:00.000" end="00:00:02.000">
        This is a subtitle<br/>
      on two lines
      </p>
  </div>
</body>

What is currently known as the TTML specification started out from the idea that a sort of super-format would need to provide a solution for the wide variety of text track formats in use at the time. This was back in the day when technologies like Flash, RealPlayer and QuickTime defined the online video landscape.

The idea of the new super-format came down to this, more or less: it should encompass all functionalities from existing timed text formats so that it would provide the perfect interchange format between applications.

Of course, with such an aim, the new format's specification became rather broad. Not only should it work for the web, but for television as well. Plus, it needed to be well-suited for authoring subtitles. In short: it needed to do everything.

Fortunately, support for all of TTML's features isn't necessary to create a working implementation. This is where so-called profiles come in. Based on the main specification they are developed to provide guidance and to ensure interoperability by specifying constraints, extensions, or both.

Sometimes such profiles are part of a specification itself and sometimes they are developed by third parties. The most important TTML profiles are: DFXP, SMPTE-TT, EBU-TT, SDP-US, CFF-TT, and, last but not least, IMSC1, the new kid on the block that has been developed to ensure interoperability between the other profiles.


A venn diagram of the different TTML profiles (source)

DFXP

Compared to the other profiles, DFXP is kind of a strange beast. Instead of defined by an external party, it is part of the TTML specification itself and consists of three varieties. Also, it predates the use of the 'TTML' name, as the specification was often referred to as 'DFXP' before it became a W3C recommendation. This used to be reflected in the file extension, which was .dfxp. before .ttml was settled on – although .xml is sometimes used as well. The net result of this: confusion.

That aside, the TTML specification defines a Transformation, Presentation and Full DFXP profile. The latter encompasses the full TTML specification, while the first two are subsets of the Full DFXP profile and each define a minimum list of TTML features that need to be supported in order to serve their respective purposes (i.e., transformation into other formats and presentation of subtitles for playout).

  • Specification of DFXP (part of TTML)

SMPTE-TT

SMPTE-TT was developed by the Society of Motion Picture and Television Engineers [https://www.smpte.org]. It differs from most other profiles by not narrowing down the main specification, but adopting its complete set of features, and extending it even further. SMPTE-TT adds three features, which are mostly concerned with providing backwards compatibility with legacy subtitling technology:

  • Usage of bitmap images that contain subtitles
  • The possibility to carry binary data (containing the bitstream of another format) [3]
  • A way to declare whether a legacy or enhanced mode is used

In 2012, SMPTE-TT was designated a 'safe harbor' by the US Federal Communications Commission (FCC). This means that online video content using captions that adhere to SMPTE-TT will comply with the US's 21st Century Communications and Video Accessibility Act.

EBU-TT(-D)

EBU, the European Broadcast Union, defines two TTML profiles. First there is EBU-TT, meant for archiving and as an interchange format. This profile also informs the second profile, EBU-TT-D, which is specifically meant for distribution over IP-based networks. Thus, only the latter is relevant within the context of this article.

Differing from SMPTE-TT, EBU-TT-D defines constraints on the TTML specification, making it slightly easier to implement support for it, as fewer features have to be taken into account. However, EBU-TT-D isn't a pure subset of the TTML specification as it defines some stylistic extensions to its parent (i.e., line padding and a modifier for text alignment).

Thus, it is possible that a player that fully supports the TTML specification does not include support for all features of EBU-TT-D. Yet, as opposed to SMPTE-TT, the extensions defined by EBU-TT-D are such that when support for these features is absent, playout behavior will be inferior but still acceptable.

The EBU-TT-D profile is the required implementation for online subtitles for HbbTV and the preferred implementation for DVB-DASH. Also, if you are working with EBU-TT and EBU-TT-D, the BBC Subtitle Guidelines document is a highly recommended read.

SDP-US

This is the most constrained of all TTML profiles, at least with regards to the TTML features that it allows for. It was developed with the specific aim of providing a minimum level of interoperability between TTML and the legacy caption formats used in the US market (CEA-608 and CEA-708).

CFF-TT

This specification has been developed by the Digital Entertainment Content Ecosystem (DECE), an open industry member organization that operates UltraViolet. UltraViolet is an online library where consumers can store their rights to media content, after they have acquired the media content through a retailer. Among DECE's members are the major Hollywood studios.

CFF-TT is based on SMPTE-TT and part of DECE's Common File Format (CFF) specification. It defines one profile for text and one for the SMPTE-TT extension that allows for the carriage of images. Compared to SMPTE-TT, CFF-TT profile for text defines some constraints and adds two options.

The first option is the ability to signal that a document can be decoded progressively and the second one is the possibility to force certain subtitles to be shown (e.g., to communicate the text on a sign or to translate occasional dialog in foreign languages).

  • Specification of CFF

IMSC1

Then finally, to bring some sense to this circus of profiles, there is the relatively recent specification of IMSC1 [4], which is a W3C recommendation and the only TTML profile that is part of the CMAF specification, a hot topic right now. The goal of IMSC1 is to increase the interoperability of TTML subtitles by incorporating support for all of the profiles discussed above (more or less).

To achieve this, IMSC1 defines some practical constraints on SMPTE-TT, the broadest TTML profile, and supplements it with a few extensions from EBU-TT-D, CFF-TT as well as some new ones. These constraints and extensions are intended to reflect industry practice (such as UTF-8 character encoding).

Like CFF-TT, IMSC1 actually defines two profiles: one for text and one for images. The one for images has the specific purpose of supporting the SMPTE-TT feature to carry captions as an image, while the IMSC Text profile is a superset of SDP-US, EBU-TT-D and (almost) the text profile of CFF-TT and text part of SMPTE-TT.

In other words: a player that supports IMSC1 should, in almost all cases, be able to playout subtitles regardless of whether they follow the SMPTE-TT, EBU-TT-D, SDP-US or a CFF-TT profile. This is true the other way around as well: it's possible to follow the IMSC Text profile while also following (one of) the aforementioned profiles.

More specific information about interoperability between IMSC1 and the other TTML profiles is part of the IMSC1 specification and can be found here.

Support by USP: Unified Packager can package TTML subtitles that follow any of the profiles discussed above into (fragmented) MP4 [5] and Unified Origin supports all of those profiles as well, for playout in Microsoft Smooth Streaming, Apple HLS and MPEG-DASH, for Live and On Demand.


From 'Monty Python and the Holy Grail'

WebVTT

Fortunately, WebVTT is a less complicated format; not only because it uses a straightforward syntax, but also because there is no wide variety of standardized implementations that need to be supported. The clear purpose of WebVTT is a major reason for this: it is not meant to be transformed into all kinds of other formats, but designed with the sole intention of delivering video text tracks through the web.

WebVTT stands for Web Video Text Tracks. It doesn't use a markup language like XML, but consists of plain-text (although it can reference CSS for styling). The origin of plain-text subtitles is very different from that of TTML: instead of starting out as specification from an official body like the W3C, WebVTT has its roots in the online subtitling community format of choice: SubRip Text (SRT).

The basic syntax of WebVTT is almost identical to SRT, save for some details (such as being constrained to UTF-8 character encoding, like IMSC1). Compared to SRT, WebVTT adds more styling options and use cases that go beyond subtitles.

The structure of a WebVTT file is as follows:

  • Header: A WebVTT file begins with a header that includes the metadata that applies to the document as a whole.
  • Cues: The main content of a WebVTT file is a sequence of cues. A cue is one or more lines of text with an associated time interval. Styles may be applied to the cue text using a simple markup syntax. Portions of the cue text may also be designated to appear at particular times in order to achieve paint-on style animations (e.g., for karaoke).
  • Regions: A cue may be associated with a region, which specifies the boundaries of the box within which the text is rendered on screen.

In WebVTT, the simple example with TTML subtitles from earlier in this blog post looks as follows (this example contains a cue only and doesn't include a header or a reference to a region):

00:00:00.000 --> 00:00:02.000
This is a subtitle
on two lines

Of course, the world of WebVTT is far from perfect. Like TTML, the specification isn't very strict, which makes it challenging to implement in such a way that all documents following the specification will be supported.

A last point of differentiation between TTML and WebVTT is HTML5 integration, which the latter does much better. WebVTT enjoys native support in all major browsers, if not for all of its features. A basic overview showing which WebVTT features are supported in Chrome, Opera, Firefox, Internet Explorer and Safari can be found here (although it is unclear how up-to-date it is).

Like IMSC1, WebVTT is part of the CMAF specification.

Support by USP: Unified Packager can package WebVTT subtitles into (fragmented) MP4 [6] and Unified Origin supports playout of WebVTT in Apple HLS and MPEG-DASH, for Live and On Demand. In addition to serving it as part of the stream, Origin will make the WebVTT file available as a sidecar file. Also, Unified Packager can convert SRT and WebVTT to TTML.

Embedded

The embedded formats are inherited from broadcast TV and the examples referred to in the list at the start of this overview have their origin in the Northern-American markets. CEA-608 was developed for analog TV, while CEA-708 is a more feature rich equivalent for digital broadcast TV.

Although it may seem counter-intuitive, it's the more recent variety, CEA-708, that's hardly encountered within the HTTP video streaming landscape. Also, use of embedded subtitles is supported in the CMAF specification.

Support by USP: When present, Unified Origin passes on and can properly signal CEA-608 as well as CEA-708 for On Demand and Live, although this functionality is only available for Apple HLS. Unified Packager does not support embedded subtitles (i.e., it cannot convert the embedded CEA-608 and CEA-708 text tracks into another format).

Burned-in

Finally, for the sake of completeness, there are burned-in subtitles, or, as they are often referred to: 'open captions'. As opposed to the 'closed' variety, 'open captions' can't be turned off. With burned-in subtitles, the reason for this is straightforward: they are part of – or 'burned into' – the image data of the video itself.

Do note that if you would want to include 'open captions' in a stream, 'burning' them into the video isn't necessary. Using the 'forced display' feature of IMSC1, it's possible to achieve the same behavior if the player supports it[7]

Support by USP: As burned-in subtitles are not represented in a separate 'text track' and therefore don't require specific support for playout. If the video track can be played, the subtitles will be played as well.

Conclusion

Supposing you read through all of the above, you may have more questions now than that you started out with. If this is so, remember: knowledge often translates into knowing what you don't know ;-). Nevertheless, there are some recommendations that you can take into account when you want to add subtitles to your video streaming setup:

  • In a live broadcast environment TTML usually is the subtitling format of choice, as it is supported as an output format by most encoders.
  • When preparing content that's intended for use on the web only, WebVTT's deeper integration with HTML5 probably make it better suited than TTML.
  • When using TTML, doing so according to the IMSC1 specification will ensure a higher degree of future compatibility as IMSC1 is part of the CMAF specification while other implementations of TTML are not.
  • As TTML and WebVTT both allow for quite elaborate styling and peculiar syntaxes, please keep in mind that plain and simple tends to equal broad support.
  • Above all: check and test what your player(s) is able to play out. If a format or certain profile is supported, that doesn't guarantee that all of the format's and profiles' features will work.

Additional resources

Footnotes

[1] Text tracks can have a purpose beyond captions and subtitles. They can carry metadata (e.g., for search engine optimization), provide chapter indications and other kinds of navigational features, as well as subtitling for karaoke. However, as text track formats are the focus of this text, it will not cover any of these additional use cases.

[2] The first version's second edition currently is the most recent TTML specification that is recommended by W3C. However, a second version of TTML is being worked on. This new version will be a superset of the first version and thus allow for backwards compatibility.

[3] As of yet, USP's support for SMPTE-TT's feature to carry binary data is untested.

[4] When TTML2 is finalized, a second version of IMSC will be presented as well. Like TTML2, IMSC2 will be a superset of its first version.

[5] In general, Unified Packager packages TTML into (fragmented) MP4 according to the ISO 14496-30 specification. This is not true when packaging SMPTE-TT with bitmaps and neither when packaging TTML specifically for HTTP Smooth Streaming. In both cases a similar but incompatible way is used. Unified Origin supports all of these approaches.

[6] Like with TTML, Unified Packager packages WebVTT into (fragmented) MP4 according to the ISO 14496-30 specification.

[7] Apple's HTTP Live Streaming (HLS) supports the use of a 'FORCED' option on a subtitles track, but this setting this option is not supported by USP.