Don’t call me DOM

12 February 2009

Diving in transcription

Filed under:

So, my exploration in world of Web video started in the land of transcription.

As I mentioned previously, one of the requirements for us on the W3C Staff to be allowed to publishing media content is to make sure it meets some minimal level of accessibility, and in the current (draft) state of affairs, this means providing a transcription of its content. (Oct 8 2009: that policy is now publicly available)

My first reaction to that policy was slightly annoyed: I was afraid this would create too high a barrier on us from publishing multimedia content, which in this age and days seems to be a fairly important expression mechanism.

But the argument was made that we don’t publish non-valid HTML documents, that we try to maintain some minimum level of accessibility across the W3C site, and that multimedia content had no reason to be treated differently – and I was convinced.

Convinced, but still preoccupied to the potential barriers it creates; and if it creates barriers for us, in an organization of technophiles with a strong interest in creating accessible content, how great these barriers would be for most of the rest of the world, who primarily wants to share their content, without much direct interest in the technologies behind it?

I figured the best way to evaluate that was to go through the exercise myself; I should note that our policy should be accompanied with a budget so that we don’t have to do it ourselves, but it still seemed important to make my hand dirty in the transcription process to understand it better.

While we’re only required to provide a transcription, I was more interested in providing captions of the video of my ParisWeb presentation – given its duration of one hour, I was pretty sure I would get bored by the transcription process if I didn’t get the nice results of synchronized captions at the end.

So I started looking for tools to create these captions; the first one I found on my Gnome desktop was Subtitle Editor, which offers a pleasant user interface, showing the video with the subtitles as you write them, and shows also a waveform (i.e. a visual representation) of the audio track to help locate pauses in it.

But it appeared pretty quickly that the tool was more geared towards editing existing subtitles, than creating new ones from scratch: the user interactions to adjust the timing of the subtitles were really overly complex, and not adapted to the setting of what would appear to be 950 synchronization points!

Looking for another tool, I found Transcriber, available on Windows, Mac and Linux.

While the rather ugly Tcl/Tk interface of the tool and its lack of integration with the audio system in place on my Gnome desktop were a bit annoying, the tool itself and its well-defined keyboard shortcuts made it a much better tool for that captioning process:

  • you load the audio file, from which a waveform is built and displayed in the lower part of the window;
  • you can then use the tab key to start and pause the audio track, and jump back and forth in the audio track using the arrow keys;
  • meanwhile, you can type the transcription and end each line of captions with the enter key

I’ve made a screencast (my latest whim on which I’ll come back later) showing that rough process of the transcription of the first few seconds of the video:

Screenshot of transcriber in action

(also available as Ogg/Theora, with a Timed Text transcription)

Transcriber is already pretty good as is, but there are two improvements I could see to such a tool that would make it even better:

  • provide auto-completion of words, both from existing dictionaries, and from the words already used in the transcription (probably with a priority to the latter);
  • work directly with video, and show video with subtitles live, the same way subtitle editor does – this provides a nice instantaneous feedback that is quite useful in this rather laborious effort;

But even with a nice tool, transcription is a very tedious work; the purely transcription part (i.e. typing what you’re hearing) is probably the worse – the synchronizing part could almost made fun by turning it into a video game of some sort…

Clearly, having a speech recognition engine doing a first pass at the transcription would be incredibly useful; I didn’t find any ready to use (and open source) package to do for a French track, but in the process of looking for one, I stumbled on a few interesting projects:

  • VoxForge which collects free voice samples to create voice profiles that can then be used by open source speech recognition projects; I contributed a couple of French samples, although the Java applet they’re using kept bugging, making the process much less easy than it should be;
  • and related to it, LibriVox which is set to create a free and community-based library of audio books; quite appealing, although I found the process for getting started a bit too intimidating – I think an even more open system where anyone could contribute any reading from a very large list of possible texts would bring much more content (in a classical Cathedral and Bazaar fashion).

Back to transcription – I didn’t measure how long it took me to transcribe the one hour long track, but at least 3 or 4 times that duration; clearly this is an exercice I’d do myself again only for much shorter content; but then, people are much more likely to watch much shorter content as well, so that’s probably a reasonable assumption.

Beyond accessibility, I found the following interesting benefits in transcribing my own presentation

  • first, I re-heard the entirety of it, and it made me look at what had changed since then, what still holds and what doesn’t any more;
  • it gave me a chance to analyse my own public speaking; I don’t know if that’s really a benefit, because it really highlighted a bunch of annoying speech tics, some twisted sentences, etc. But even if it hurts my ego somewhat, I’m hoping it can help me reduce some of these bad habits…
  • I got a chance to correct in the transcript some rather obvious mistakes I made while talking (using “computers” where I meant “phones” for instance); I’m sure there are conventions on how to express this – I used “[phones]”;
  • generally speaking, having a text transcript of the video offers a bunch of creative usage of that new text for further manipulation; e.g. here is the Wordle of the transcript of the video:
    Cloud of words taken from the audio track of the video.

And more importantly, this was a trigger for me to look more into the video / text synchronization questions, which will be the topic of my next blog entry in this series.

6 Responses to “Diving in transcription”

  1. Eric Says:

    I can’t for the life of me figure out how to skip back a set number of seconds after pausing. I press tab to pause, and I want to press a button that starts playing from 2 seconds before the pause point. Did you figure out how to do this?

  2. Susan Says:

    Thank you for letting me know about these interesting sites and utilities!

    I mean, merci!

    Susan

  3. Dom Says:

    Hi Eric,

    I think you can get that effect using the setting in “Options/Audio File…/Go back before playing” (but then it will be enabled each time you pause and playback, not by pressing a specific key)

  4. Dom Says:

    The policy I alluded to that requires W3C Staff to only publish accessible media content has now been released publicly: http://www.w3.org/2008/06/video-notes

  5. Dom Says:

    If anyone needs it, I have built s simple XSLT that turns Transcriber’s XML format into SRT and made it available at:
    http://www.w3.org/2009/02/presentation-viewer/transcriber2src.xsl
    (yes, the URI is 2src rather than 2srt — my fingers cheated me!)

  6. Xavier Says:

    I always use the FTW Transcriber (which you can get here: http://www.theftwtranscriber.com/ It’s also free, and has various features Transcriber doesn’t have, for example it can play remotely hosted files. Also, it plays a much wider range of file types – the writers say it plays whatever Windows Media Player can play. It also displays the visuals in videos much better than Transcriber does.

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.