So, my exploration in world of Web video started in the land of transcription.
As I mentioned previously, one of the requirements for us on the W3C Staff to be allowed to publishing media content is to make sure it meets some minimal level of accessibility, and in the current (draft) state of affairs, this means providing a transcription of its content.
My first reaction to that policy was slightly annoyed: I was afraid this would create too high a barrier on us from publishing multimedia content, which in this age and days seems to be a fairly important expression mechanism.
But the argument was made that we don’t publish non-valid HTML documents, that we try to maintain some minimum level of accessibility across the W3C site, and that multimedia content had no reason to be treated differently – and I was convinced.
Convinced, but still preoccupied to the potential barriers it creates; and if it creates barriers for us, in an organization of technophiles with a strong interest in creating accessible content, how great these barriers would be for most of the rest of the world, who primarily wants to share their content, without much direct interest in the technologies behind it?
I figured the best way to evaluate that was to go through the exercise myself; I should note that our policy should be accompanied with a budget so that we don’t have to do it ourselves, but it still seemed important to make my hand dirty in the transcription process to understand it better.
While we’re only required to provide a transcription, I was more interested in providing captions of the video of my ParisWeb presentation – given its duration of one hour, I was pretty sure I would get bored by the transcription process if I didn’t get the nice results of synchronized captions at the end.
So I started looking for tools to create these captions; the first one I found on my Gnome desktop was Subtitle Editor, which offers a pleasant user interface, showing the video with the subtitles as you write them, and shows also a waveform (i.e. a visual representation) of the audio track to help locate pauses in it.
But it appeared pretty quickly that the tool was more geared towards editing existing subtitles, than creating new ones from scratch: the user interactions to adjust the timing of the subtitles were really overly complex, and not adapted to the setting of what would appear to be 950 synchronization points!
Looking for another tool, I found Transcriber, available on Windows, Mac and Linux.
While the rather ugly Tcl/Tk interface of the tool and its lack of integration with the audio system in place on my Gnome desktop were a bit annoying, the tool itself and its well-defined keyboard shortcuts made it a much better tool for that captioning process:
- you load the audio file, from which a waveform is built and displayed in the lower part of the window;
- you can then use the tab key to start and pause the audio track, and jump back and forth in the audio track using the arrow keys;
- meanwhile, you can type the transcription and end each line of captions with the enter key
I’ve made a screencast (my latest whim on which I’ll come back later) showing that rough process of the transcription of the first few seconds of the video:
Transcriber is already pretty good as is, but there are two improvements I could see to such a tool that would make it even better:
- provide auto-completion of words, both from existing dictionaries, and from the words already used in the transcription (probably with a priority to the latter);
- work directly with video, and show video with subtitles live, the same way subtitle editor does – this provides a nice instantaneous feedback that is quite useful in this rather laborious effort;
But even with a nice tool, transcription is a very tedious work; the purely transcription part (i.e. typing what you’re hearing) is probably the worse – the synchronizing part could almost made fun by turning it into a video game of some sort…
Clearly, having a speech recognition engine doing a first pass at the transcription would be incredibly useful; I didn’t find any ready to use (and open source) package to do for a French track, but in the process of looking for one, I stumbled on a few interesting projects:
- VoxForge which collects free voice samples to create voice profiles that can then be used by open source speech recognition projects; I contributed a couple of French samples, although the Java applet they’re using kept bugging, making the process much less easy than it should be;
- and related to it, LibriVox which is set to create a free and community-based library of audio books; quite appealing, although I found the process for getting started a bit too intimidating – I think an even more open system where anyone could contribute any reading from a very large list of possible texts would bring much more content (in a classical Cathedral and Bazaar fashion).
Back to transcription – I didn’t measure how long it took me to transcribe the one hour long track, but at least 3 or 4 times that duration; clearly this is an exercice I’d do myself again only for much shorter content; but then, people are much more likely to watch much shorter content as well, so that’s probably a reasonable assumption.
Beyond accessibility, I found the following interesting benefits in transcribing my own presentation
- first, I re-heard the entirety of it, and it made me look at what had changed since then, what still holds and what doesn’t any more;
- it gave me a chance to analyse my own public speaking; I don’t know if that’s really a benefit, because it really highlighted a bunch of annoying speech tics, some twisted sentences, etc. But even if it hurts my ego somewhat, I’m hoping it can help me reduce some of these bad habits…
- I got a chance to correct in the transcript some rather obvious mistakes I made while talking (using “computers” where I meant “phones” for instance); I’m sure there are conventions on how to express this – I used “[phones]”;
- generally speaking, having a text transcript of the video offers a bunch of creative usage of that new text for further manipulation; e.g. here is the Wordle of the transcript of the video:
And more importantly, this was a trigger for me to look more into the video / text synchronization questions, which will be the topic of my next blog entry in this series.