Karaoke Feature Explainer

TTML Karaoke Extension Explained

Last update: [2019-05-01]

What’s all this then?

A new feature is being developed to enable representing Karaoke or sing-along text content in TTML.

Karaoke or sing-along are well-known applications: song lyrics are displayed on top of a corresponding video clip, with timed emphasis on words, syllables or characters to indicate to the viewer which words/syllable/characters have been sung, are being sung or will be sung. Today, on the web, karaoke content is typically burned in the video content, whichs make the text content non accessible and not controllable by users or applications. Examples of Karaoke can be found on YouTube where the text is burned in the video: Moana, or Frozen.

This proposal focuses on describing how to associate timing with text content in order to represent karaoke content. The exact presentation of textual content is left implementation specific. Style properties may be defined in the future.

Goals

Enable karaoke authors to represent and exchange karaoke content (i.e. text and timed changes) in text files
Enable the delivery of karaoke content as separate video and text files
Enable web application to control the display of karaoke content
Enable end users to adjust the presentation of karaoke content based on their preferences

Key scenarios

Text Highlighting

Color Highlighting

In a scenario, it should be possible to highlight a word (or part of a word, or words) with a fixed color (possibly with transparency), during a period of time and to change that to consecutive words at a pace decided by the author, to show the progression of the karaoke.

Image Highlighting

In another scenario, it should be possible to position a mark on top or below the word (or maybe left/right for vertical writing mode) to indicate the current word (words or part of words) and to move this mark at certain author-controled times to other words to show the progression. The mark could be a dot, an image or anything else. This should be done without having the author compute the exact position. This could also be in complement of the color highlighting. Multiple types of highlights can be applied.

Continuous Highlight Transitions

In additional scenarios, it should be possible to move the highlight (whether a color, a mark or else) continuously within or between words (or parts of words). In particular, when a mark is used, it should be able to transition it from offscreen to the first word, from the last word to offscreen, between words, including from the last word of an event to the first word of the next event. These continuous changes should be doable without the author having to compute positions of words. Transitions should be able to span text event boundaries.

Implementation-specific Highlighting

In all scenario, it should be possible for a user to set their prefered highlight (text color, background color, text emphasis, image mark, text effects) or for an implementation to apply default styling.

Detailed design discussion

In TTML, timing information can be specified on different content elements (e.g. p, span). In this case, they control whether or not the element and its content participate in the layout for the given time interval. Animation elements can be used to describe changes (e.g. stylistic changes) when the element is presented.

Approach 1: duplicate p elements

In the example below, the text content is presented only during a time interval.

<p begin="10s" end="20s">This is one line of a song lyrics</p>

One way to solve the color highlight scenario above is the following:

<p begin="10s" end="12s">
<span style="style1">This is</span> one line of a song lyrics
</p>
<p begin="12s" end="15s">
<span style="style1"This is one line of</span> a song lyrics
</p>
<p begin="15s" end="20s">
<span style="style1">This is one line of a song lyrics</span>
</p>

This approach has the following advantages:

the styles can be shared between all events: for example, style1 could be tts:color="yellow".

but has the following drawbacks:

the text content is duplicated
if some intermediate times are contiguous, they are repeated (e.g. time 12s is used twice)
the highlight changes can only be discrete
the exact styles for highlights have to be explicit and cannot be left implementation-specific, or user-prefered.

Approach 2: use of animation elements

Another way to solve the color highlight scenario above is the following:

<p begin="10s" end="20s">
<span><set begin="0s" tts:color="yellow">This is</span> 
<span><set begin="2s" tts:color="yellow">one line of</span> 
<span><set begin="5s" tts:color="yellow">a song lyrics</span>
</p>

This approach has the following advantages:

more compact (no text repeated)

but has the following drawbacks:

animations today do not allow changing the style attribute. One has to animate all styles separately by using multiple attributes on the set element.
the exact styles to be animated have to be explicit and cannot be left implementation-specific. There is no semantics indicating that the highlight is a karaoke one, and that it can be replaced by an implementation or even deactivated by the user.
transitioning a mark from the last word of an event to the first word of the next event cannot (at least easily) be represented as the position of the last word the previous event is not: a) explicit (it depends on layout) and b) gone at the time of the ISD for the next event is processed (some state across ISDs has to be preserved)

Approach 3: use of semantic markers

The following approach is based on the use of a new element (but possibly a metadata element could be used).

<p begin=”10s” end=20s>
<marker time=”10s” type=“karaoke”/>
This is <marker time=”12s” type=“karaoke”/>
one line of <marker time=”15s” type=“karaoke”/>
a song lyrics</p>

Different types of transitions can be represented by different type values, whose semantics will be defined.

This approach has the following advantages:

The text content is not repeated
Semantic information is conveyed. Styles are not associated explicitly, but could be if needed. Users can override or disable styles.
Implementations not understanding this element can simply ignore it.

but has the following drawbacks:

it assumes that each marker type defines the content with which it interacts (e.g. all character content until or from the previous marker)
could be verbose depending on how many markers are needed. For example, if a precise control of the transition of the mark compared to the highlight is needed, it could look like this:

<p begin="10s" end="20s">
<marker type="karaoke-start" time="10s"/>
<marker type="karaoke-fromoffscreen-start" time="10s"/>
<marker type="karaoke-fromoffscreen-end" time="11s"/>
<marker type="karaoke-highlight" time="11s"/>
This is
<marker type="karaoke-next-start" time="11.5s"/>
<marker type="karaoke-highlight" time="12s"/>
<marker type="karaoke-next-end" time="12.1s"/>
one line of
<marker type="karaoke-next-start" time="14.5s"/>
<marker type="karaoke-highlight" time="16s"/>
<marker type="karaoke-next-end" time="16.1s"/>
a song lyrics
<marker type="karaoke-next-start" time="19.5s"/>
</p>
<p begin="20s" end="30s">
<marker type="karaoke-next-end" time="20.1s"/>
...

In the above approach, the "karaoke-start" approach would not work if the user is seeking in the middle of the karaoke.

Approach 4: transitions

This approach proposes a model inspired by CSS Transitions, where transitions can be assigned to paragraphs and/or spans.

<div ttp:transition-mode="karaoke">
<p begin="10s" end="20s" ttp:transition-type="fromoffscreen" ttp:transition-begin="10s" ttp:transition-dur="1s">
<span >This is </span>
<span ttp:transition-type="jumptonextword" ttp:transition-begin="11.5s" ttp:transition-dur="0.6s">been</span>
<span ttp:transition-type="jumptonextword" ttp:transition-begin="14.5s" ttp:transition-dur="0.6s">one line of </span>
<span ttp:transition-type="jumptonextword" ttp:transition-begin="19.5s" ttp:transition-dur="0.6s">a song lyrics</span>
</p>
</div>

In the above, only one transition can be assigned per element, which could be problematic. In the variant below, a transition element is introduced (like the animation/set elements):

<div ttp:transition-mode="karaoke">
<p begin="10s" end="20s">
<transition type="fromoffscreen" begin="10s" dur="1s"/>
<span><transition type="nextword" begin="11.5s" dur="0.6s"/>This is </span>
<span><transition type="nextword" begin="14.5s" dur="0.6s"/>one line of </span>
<span><transition type="nextword" begin="19.5s" dur="0.6s"/>a song lyrics</span>
</p>
<p begin="20s" end="30s">
<transition type="toffscreen" begin="29s" dur="1s"/>
<span><transition type="nextword" begin="22s" dur="0.6s"/>This is </span>
<span><transition type="nextword" begin="25s" dur="0.6s"/>one line of </span>
<span>a song lyrics</span>
</p>
</div>

References

Proposed Highlight API

Acknowledgements

The member of the TTWG, Tess O'Connor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karaoke Feature Explainer

TTML Karaoke Extension Explained

What’s all this then?

Goals

Key scenarios

Text Highlighting

Color Highlighting

Image Highlighting

Continuous Highlight Transitions

Implementation-specific Highlighting

Detailed design discussion

Approach 1: duplicate p elements

Approach 2: use of animation elements

Approach 3: use of semantic markers

Approach 4: transitions

References

Acknowledgements

Clone this wiki locally