HTML5 video accessibility and the WebVTT file format

Black: So welcome all.
Thank you for coming. My name is Naomi Black, and I’m a member of the
Accessibility Engineering Group here at Google. And today,
I’m very pleased to invite our speaker,
Dr. Silvia Pfeiffer. Silvia is a member
and an invited expert on the W3C HTML
Accessibility Task Force. And she’s also
the author of “The Definitive Guide
to HTML5 Video.” So Silvia’s here
to talk to us– come on up. Silvia’s here to talk to us
today about WebVTT, which is one of the standards
for timed text which is under consideration
by the W3C. Thank you, Silvia. Pfeiffer: Thank you. Thanks for inviting me
to come and speak about this
important topic today. We know there are a lot
of discussions going on about formats
for captions, and we want to standardize them
for the web. But standardizing it
for the web has a much larger
impact these days than just on web browsers, just on the web itself. It goes into many
different devices. So we’re very interested
and very keen to give a broad coverage
of available technology, and this is what we’re trying
to do here today. So I’ll be talking mostly about
WebVTT, the file format. But I’ll also be talking
a little bit about how to plug that into
the web browser into HTML so that in the future, we have a very simple way
of displaying captions in web browsers on videos without having to do
much more than authoring a file and giving a link to
the web browser for that file. So it will be very simple
for people in the future to create more captions. All right, let’s dig right in. As we were looking
at requirements of such a text format, a web text format
for video, we looked at the different
types of content that can be time-aligned
with video. And captions and subtitles
are the obvious ones, but text video descriptions
are also an important use. These are for blind users and can be read out
by screen readers in parallel to the playback
of the video. This may well not be
the most usable way of doing
audio descriptions, but it is a much easier
way to publish audio descriptions
for blind users. And, in fact,
for a lot of blind users, it may well be
all they need, because they already have
their screen readers set up, and it works really well
for some people. Further to that, we’re also talking about
navigation or chapters, which is also
very important for blind and,
in fact, any user. If you want to go through
a video quickly and find out
what’s in there, you want to jump to what we now
know as chapter markers. We can call them
navigation markers. This can be also covered
with the same kind of format. And more generally,
metadata. This is something
that archives are particularly
interested in, to attach metadata
to sections in the video. It can also be done with such
a time-aligned text format. So what we have
discussed for browsers is a very simple format. It’s called WebVTT,
Video Text Tracks. WebVTT. This is one
of the very simple files that we can think about. Just a marker
at the beginning of the file that identifies
the file format. The captions or subtitles–
let’s call them cues– then have an individual
identifier. In this case, it’s
the number one and number two. Could be any string, however. And then we’ve got start times
and end times on each one of these cues, and a piece of text
in there. It turns out in– as we all know how captions
are displayed on screen in something like this if it’s automatically
created by the browser. That was the very simplest way
of doing subtitles. Now, we want to do more
than just the simple captions. In particular,
if we want to achieve all the functionality of, for example,
the CEA-608 captions, then we need to do a bit more
than just text. We also want to have
some formatting in there. Here is an example
on how to do bold. I’ll point to it. There’s a bold tag in here,
so that will be bold text. Here is some italic text. And up here
we’ve got a general way to associate style or a class
to a piece of text and give it a meaning. In this situation, we’ve turned a piece of text
into red text and capitalized it. Of course, if we’re using
this format also for subtitles, we need to be careful to cover
internationalization issues. WebVTT is very clear here. It requires UTF-8
character encoding. It has a ruby tag, which supports
Asian languages in particular. It also does vertical and
horizontal rendering of text. Again, possibly one of the most
important ones are Asian languages, and I think there are
a few other languages that are also
rendered vertically. And we need to make sure
we get the alignment right. Sometimes text is read
from the right to the left, so therefore it needs
to be aligned on the right rather than on the left. Now, positioning
is another requirement and, again, something
that traditional captions, TV captions,
are able to do. It’s possible to position cues
anywhere in WebVTT. There are basically
three important ways to position text. There are line positions. So the concept of display lines
exists in WebVTT. So the line position
allows people to directly address
a specific line. It can be done with
a line number or a percentage. Then we have
the text position. This means we’re placing
the text either on the left, in the middle,
or on the right. No, hold on.
That’s the alignment, sorry. Alignment is
left, middle, and right. And the text position is… so when we have text
like this, it’s in the middle,
and it’s centered. So we can also do a centering, and we can do
a left alignment and a right alignment. But we can also move
that whole text elsewhere. So the text position
is where we move the text and the alignment is
where we align it at. Sorry for the confusion. We also have speaker semantics
included into WebVTT, which is interesting
because it allows us to put some semantic information
into our markup. Here, for example,
we have two people speaking. We know their position
on the left and on the right. And the speaker markup can tell us where we want
to position it and can also, for example,
help us always use the same styling
for the same speaker. So, for example, we want
to use the same font, the same font color, maybe a specific outline
or something for a speaker. We can define that and then apply that always
for that speaker. Now, so much for captions. Now we move on
to a little example on text descriptions. Here is one
that I’ve used previously, and we’ve got that
as an example on the site. I’m not gonna go there; I just want to mention it, because we want to focus
on captions today. But what happens here is,
we’ve got text that’s aligned with a start and end time
as well. And for a typical word rate
of a screen reader, it will fit into that space, and it will be read back
by the screen reader during that time. And here is
the navigation example. As I mentioned, WebVTT can also
be used for navigation. Here we have three chapters, and we can directly jump
from chapter to chapter. There needs to be extra controls
on videos to support this, but this is something
we’re working towards. Now, of course,
as I’m saying, controls and input
into web pages and automatic rendering, we need to know
how we’re going to do that. And there is markup
in HTML5 for associating captions
and formats like this with videos. In this example,
I’ve got all of the VTT files that we’ve used before. I’ve included them here. And what we’re using for it
is called a track element, and this track element
is included underneath the video element
in the HTML5 markup. It links through
the VTT file. And there is some description
possible for the type of file it is,
so we have a label. In this case,
it’s an English caption. We have a kind,
which gives us a means to group all the tracks
of the same type together. And we identify
the language. Because, of course, when we have
user settings in browsers, we want to automatically
make certain tracks available to the user if the user has,
for example, said that they always
want captions or they always want subtitles
in their language being shown. So the browser can
look through this markup and identify which ones
it has to turn on by default. Now, in this case,
I’ve used only WebVTT files. The track layout, the way that
we’ve defined track in HTML5, is actually generic. It can be used for other types
of files as well, TTML or SRT
or any other formats that will be implemented. But the generic way that track
works is in this way. Now, once we’ve got it
in the browser, we can actually support more
than what is directly possible as markup
in the WebVTT file, because now we’ve got the text
in the browser, and we can make use of all the functionality
of the browser, which has styling and the concurrent style sheet
functionality available to it. So this kind of styling
is also available if used in a browser,
to these cues. And the way in which
this is being done is that there’s
a pseudo-element in CSS called ::cue. And with that pseudo-element, you can address,
for example, classes in the cue markup. And you can override
the formatting that by default
would be given. You can, for example– well, in this case,
it’s been turned red, uppercase,
a different font family, and a lighter weight. Now, we’ve spoken a lot. We want to see a little bit
of a demo here, and I’ve made a little bit
of a demo which shows that we can
do more than what’s typically being used for captions
right now. Most captions that are
being used are pop-on captions, which are captions
that don’t overlap in time. There’s one piece of caption, one cue shown; it disappears, and the next cue
is brought up. That’s pop-on. And that is the default way
of rendering it. But we may have
a very different style of providing captions
as well, which has traditionally
been used mostly in live captioning. It’s called roll-up. So the cues will actually
be added at the bottom and roll up as the– and the old ones
will disappear. So I’ve made a little example that shows how that
can be done as well. Let’s hope this works. Man: I heard about
this Arduino Project, and I saw it online,
and I said, “Wow, “a lot of people are starting
to talk about this. I should check it out.” Second Man: ‘Cause we wanted
to make a tool for our student
that was more modern than what was available
on the market at the moment. Third Man: For me,
it was a case that this is a tool
that I could see using myself, and therefore
I could believe in actually helping to get it
out to a wider world. Fourth Man:
[speaking foreign language] Pfeiffer: So you could see that the captions
were being pushed up as they were being displayed. This is a very simple way of doing
this kind of roll-up. So as we’re moving on, the next caption
gets added to the bottom. This can be improved. This is just
a very crude demo. But this can be improved with a bit more CSS
in the browser. We can, for example, transition
the text more slowly, and then it would be
more readable, rather than it
jumping there directly. There’s a whole swag
of CSS functionality available to us
in the browser to make this look very nice. And the functionality is there
and possible to be used. I’ve mentioned
paint-on and roll-up captions. I’ve mentioned pop-on
and roll-up captions. I want to briefly also mention
paint-on captions, even though that’s a bit more
of an exotic use case. But it’s possible to be used
in CEA-608 captions, so we need to make sure
that it’s also possible to be represented in WebVTT. And what we’ve introduced
for this kind of application is cue timestamps. These cue timestamps
are basically just a timestamp that is being included
into the text and says when the text
that comes afterwards will be activated. Here I’ve done it– [coughs]
pardon me– at the word level. I’ve put cue timestamps in
for every word, so every word would come up
one after the other on the screen. However, the resolution
is arbitrary. We could do that
on every character if necessary. Interestingly enough, that can also be used
for styling through CSS. There are the past and future
pseudo-selectors. And these selectors allow us,
for example, to do something like
paint the old text in yellow, the new text in white, with a text shadow, and as it goes over,
everything goes yellow. And we know this kind
of application, obviously, from karaoke, which bridges into
these applications as well, into more modern
time-aligned text applications. We can cover
all of these use cases with the same approach. So that brings me
to the end of the presentation. We regard WebVTT as a bridge
between broadcast and the web of the future. We can support
all of the CEA-608 captions, all of the features, possibly also
some of the 708 features– I think most of them;
I haven’t analyzed in detail. But most of the 708 features
will be supported as well. It’s a simplicity
of editing, which we like
about the WebVTT format. It’s readable.
You can read it here on screen. There’s not too much busyness
as you’re looking at it. And that means also
that it’s easy to edit and to create. We have the ability
to apply web styling through the track mechanism that has been included
into HTML5. And this is an open
and freely available format. If you’re looking
for references, I’ve put the references
on this last slide to all the specs. They’re available for free. Thank you very much. [applause] Black: So thank you, Silvia. We have a mic up if people
want to ask any questions. Maybe you could introduce
yourself briefly. We’re recording this, and we’re gonna be
posting it to YouTube, so hopefully
you won’t be shy. But please, if you have
any questions for Silvia about WebVTT,
please step up to the mic. Pfeiffer: I was going
very quickly, so if somebody wants
to go back and explore any of the features
in more detail, this is probably
the opportunity. Steinberg: Hi,
I’m Daniel Steinberg at Google. A couple questions. When you had the line number
specification, how does that apply
when you have vertical text? Pfeiffer:
Let me just find that. Here, this one? Steinberg: Yeah. Pfeiffer: So the line numbers
are basically done in– for horizontal text, obviously from the top
to the bottom. And for vertical text,
they are turned around. So they apply in the same way. Steinberg: Okay, and you have
a lot of the underpinnings for interactive text,
but not the actual– I didn’t see anything
actually there. So the ability to say
this particular tag is live, and clicking on it might give
a link or something like that. Have you considered
interactivity? Pfeiffer: Interactivity– you particularly talking
about hyperlinks in this case? Steinberg:
In this case, yes. Pfeiffer: Yes, so hyperlinks is something of a bit
of a controversial issue. We’ve discussed this. It is obviously something
that can easily be added. We’ve got the markup
in HTML5. We could easily put a “a” tag
in there and a hyperlink. It’s not something
that’s currently part of the specification simply
because people don’t believe that it’s a very good
experience. When you’re watching captions,
they stay on the screen only for a very short
amount of time. By the time you’ve decided
that you want to follow a link, it’s already gone. That’s the reason. So I’m obviously not fully
subscribed to that reason. I would actually like to have
hyperlinks in it as well. What I look at in this format
is that it’s easy to extend it, and if somebody
was to support it, then it is not a problem
to put that in as well. Steinberg: Yeah, something
to consider with the hyperlinks is that, because
you have the ability to author
different kinds of files, you could have a description
that had a longer time and had a longer duration. It wouldn’t
have to disappear when a caption line
disappeared. Pfeiffer: In fact,
I should actually add something. We’ve got, as I mentioned,
we’ve got kinds of text that we’re expecting
in a WebVTT file. I’ve talked about
captions and subtitles, I’ve talked about
text descriptions, and I’ve talked about
navigation. I’ve only mentioned metadata, but metadata actually
solves that problem. Metadata means that you’re allowed to put anything
into a cue: any markup you want, any non-markup you want, any text, anything at all. It just means that the browser
can’t do anything with it. It decides–it sees that
it’s of the kind “metadata” and goes,
“I’m being hands-off. “I’m just gonna hand it on
to the JavaScript, and the JavaScript can do
with it whatever it likes.” So this would be one way
to have interactivity in it. You could just
grab it through JavaScript and then put it into a div
on your page, and then there would be
a hyperlink. And anything else
that you can come up with that’s time-aligned would
work in a similar way as well. Steinberg:
And one last question. You showed
the different formats– captions and descriptions
and whatnot– as essentially the same
VTT format, distinguished only
by the track tag. And I wonder if there’s
enough semantic difference that you’d want to be able to
distinguish it in another way. Like, for instance,
would you build a– could you build a file that had captions
and descriptions in it? Or might you want to have
some identifier in the file say, “This is a description file,” rather than count on
the track tag? Pfeiffer: Mixing content
in one track is of course possible. Like, you could concatenate,
for example, a caption file with
an audio description file. Then you would basically
have two tracks available through one file. It’s not a very easy way
to deal with, and it would require a lot
of additional implementation. So, for example, if that concatenated case
was to be handled, the browsers
would need to find out where the second file starts
and so on. We actually like to keep
the semantics separate. And HTML markup is built
on keeping semantics separate. So this is why we introduced
the “kind” attribute. And so therefore,
in one file, you will only find captions
of one type. Thank you. Foliot: Hi, Silvia. John Foliot
from Stanford University. When you gave
the code example of the CSS, it’s not clear where
the CSS actually lives. Is it embedded
in the WebVTT file? Can you have an external file
and link it? Can you just maybe get into that
a little bit further? Pfeiffer: Yeah, let me
just find it. Sorry. So I’ve deliberately just
put that there as a snippet, because it actually
doesn’t matter where it lives. At the moment,
because we’re doing the CSS through the HTML page, that includes
this file up there… that HTML page could
either have this CSS piece directly in the HTML page
and address it– so an in-band CSS. Or it could be in
an external CSS file and be pulled into
the HTML page together with
the WebVTT file. Foliot: So it always sits
outside of the VTT file. Pfeiffer: It can.
Yeah, well… So we’re currently
under discussion whether this
is a functionality that we’d want to add
to WebVTT as well. So whether you want
a WebVTT file that links
to a style sheet. We’re careful about that
because we– this comes from
a very web point of view. We don’t really want to pollute
the WebVTT file with this web functionality, because there are applications
outside the web browser that do not want to have
to implement, for example,
all of CSS style sheets in order to display
captions properly. So it probably
makes more sense to have this
as a separate file. And if people do want to have
this additional functionality, they can use
the WebVTT file and the CSS file together and parse them, and use them
in their style sheet engine to come up
with the proper display. Black: I actually,
I have a practical example of why you would want
the CSS. Is this on?
Can you hear me? Pfeiffer: Yes. Black: I have
a practical example of why you would want
the CSS to be outside the VTT. I work currently with people
who are producing caption files for the UK market and who are then redistributing
that same video here in the US, and they have to basically
redo the entire caption file because UK audiences
are expecting, for instance, to see a particular speaker
marked up in a particular color. Here in the US,
we’re not expecting that. So you could imagine
if you had one format where you marked up
the content semantically that, depending on whether
you showed it in a player here for the US
or there for the UK with the same caption file, you could display it
differently according to different users’
regional preferences. Foliot: I would also add
that user style sheets, the user could actually
increase the font size to their
specific requirements. Pfeiffer: Yes. Foliot: A second,
really minor question. On the alignment, you said it could be
left, right, or– was it
middle or center? Can you actually
have it justified as well? Pfeiffer: No. I don’t think we have
a means to justify the text. But I also think that,
actually– so from all the quality captions
literature that I’ve read, how to do quality captions, I think justification
has never been proposed as a readable way
of doing it because it changes
the spacing between characters and so therefore makes the text
actually harder to read. Foliot: I agree. Pfeiffer: I don’t think we
actually need that feature. But it is
a good point, yeah. Foliot: Well, it just wasn’t
specifically declared one way or the other. I agree with you. Pfeiffer: Yeah, okay, fine. Excellent. Have we got
any more questions? Well, thank you very much. I suppose if anyone
has any more questions, we’re always here
to answer more questions. There’s the Accessibility Group
at Google, through which Naomi
can be reached. And I’m very active
at the W3C. Feedback can be sent also
to the WHAT Working Group. There’s plenty of ways
to get to us. I should have probably put a contact slide in there
as well. I’m also on Twitter
and Facebook and so on. But just Google my name,
Silvia Pfeiffer. You’ll find me. Thank you very much. [applause]

7 thoughts on “HTML5 video accessibility and the WebVTT file format

  1. Excellent presentation. I must admit I'd love to see hyperlinks officially included within the spec (but I can understand the reasons given as to why they're not). I also think it's a great idea to keep style and structure separate – caption formats which mix both are ugly and difficult to read, in my opinion!!!

  2. Is this already available in browsers like Chrome?
    Also, does the WebVTT format depends on the format of the video as well?

  3. you can shove html5 video up your ass, its only here to keep apple fanboys happy, as apple products are not in the 21st century and dont play flash. html5 video wont properly go in full screen either, god i hate apple

  4. HTML5 goes full screen much easier than Flash does, in modern web browsers Synthematix. What browser are you using?

Leave a Reply

Your email address will not be published. Required fields are marked *