Yohan Chalier Projects Portfolio Blog

Subtitles Aligner

A Python module that automatically syncs an external .SRT file with a video. Here is how it works:

  1. Extract the audio of the video
  2. Use speech-to-text to get a partial transcript of the video
  3. Translate the subtitles into the video language
  4. Match the two together, and shift the subtitles adequately

Speech-to-Text

The first step consisted in a survey of available software for the speech-to-text and translation steps. I was looking for free, preferably open-sourced, offline software, with multi-language support.

The first search result is of course the Google Speech-to-Text API. And of course, it works well: transcripts are good and a lot of languages are supported. However, it requires an online access to the Google API, which is very limited, considering the free condition. I therefore decided not to go for Google. Instead, I focused on the CMUSphinx Project. While still supporting a lot of languages (see here), it is free and offline. But its usage is not as easy as Google API's.

First, voluminous language models must be downloaded for each language to be parsed. Then, I faced some encoding issues. My first attempts on clear audio recordings yielded gibberish transcripts. I soon suspected the issue was coming from the encoding, but I had quite some troubles finding what exact format was needed as input. I finally found it on the CMUSphinx FAQ page:

CMUSphinx decoders do not include format converters, they have no idea how to decode encoded audio. Before processing audio must be converted to PCM format. Recommended format is 16khz 16bit little-endian mono. If you are decoding telephone quality audio you can also decode 8khz 16bit little-endian mono, but you usually need to reconfigure the decoder to input 8khz audio. For example, pocketsphinx has -samprate 8000 option in configuration.

The command I ended up using to extract the audio from video files is:

ffmpeg -i <input> -ar 16000 -ac 1 -f wav -bitexact -acodec pcm_s16le <output>

The specification of the WAVE header is important here since I use it for the correct timing of the extracted fragment. That was my second big issue with CMUSphinx. The model provides extracted words as tuples of the form:

("hello", 100, 200)
("my", 200, 220)
("name", 230, 300)
("is", 300, 320)
("jonathan", 320, 440)

where the two integers are the starting and ending frames where the word occurs. Those frame indices are bound to the sampling frequency of the raw audio data and its internal representation by CMUSphinx. In the subtitle files, actual timing are used, hence I have to convert those frames to real timings. I thought I could do that by simply dividing by the sampling frequency, but I was wrong. No matter what I did, I always had issues with some strange offsets getting introduced, in a way that seemed rather random. After quite some troubleshooting, I finally found the error, thanks to some visualization I did of the exact transcription versus what the model decoded and at what time:

Long video files would take too long to process as a whole. Instead, I only need some parts of the recording to correctly match the subtitles. To do so, I first used the SpeechRecognition Python module, as it provides a record method for AudioFile allowing for the extraction of small parts of the recording. Each fragment was then transcribed independently of the others. The beginning of the fragment was applied as an offset to all the timings of the words present in that fragment. I noticed the timing errors was not consistent among the pool of fragments. When a fragment started with a lot of silence, the error in the timing seemed to increase as well. That was the mistake: the CMUSphinx model has a remove_silence option, set to True by default, which makes the transcription start when the speech starts, not when the fragment begins. Unfortunately, I was not able to change this configuration from the SpeechRecognition module, and instead had to manually use the Pocketsphinx model with a manual split of the raw audio data. For that I use the audio sampling frequency and the data size of the WAVE file, two numbers present in the header of the file.

After having solved those issues, the audio transcript seemed to work fine. Precision is surely not perfect, but usable.

Translation

In order to match the subtitles with the audio transcript, the two have to be in the same language. Finding a good software for machine translation was not as easy as I thought it would.

First, as for speech-to-text, the first result you get when looking for an efficient and easy-to-use method for translating text from a variety of languages, Google Cloud Translation shows up. It is indeed good, but suffers from the same limitations as the speech-to-text API. Finding a substitute was quite hard actually. DeepL or Bing API are similar to Google's. I first had a look at Wikipedia's comparison of machine translation applications, and developed good hopes for Apertium. Unfortunately, the French/English pair support is unofficial and what I tried was really defective. I tried other platforms such as Moses or Apache Joshua, but those were not really suited for my purpose, and most importantly were really too cumbersome to use.

After some more research, I stumbled upon Word2word, a method developed by researchers using automatically built large dictionaries for word to word translation. And actually, this could be suited to my purpose: lots of languages are supported, and it is really easy to use. The quality is not so good though, and for instance a confidence score in the translation could be really helpful sorting the results.

Sequences Alignement

Both the subtitles and the transcript are converted into a WordSequence object, which consists of a list of time windows containing one or more words. The objective is to find the timing offset that minimizes the difference between the two. My approach is rather simple:

  1. split the two sequences into buckets of fixed duration
  2. try various bucket (discrete) offsets
  3. pick the best offset regarding some similarity function
  4. repeat with a smaller bucket duration

Using duration buckets allow for a discrete exploration of the offset to find, while repetition with reduced size allow for more and more precise offset estimation. In theory this works well, but major issues have to be dealt with:

  • errors in speech recognition
  • errors in subtitles translation
  • subtitles do not allow for precisely time word by word transcript

Errors are to be expected. Even with perfect recognition and perfect translation, translation is not deterministic. To compensate for the inevitable errors, words are lemmatized and several word translation are considered during the matching.

Regarding the subtitles time precision, I tried different approach, such as splitting the duration of each subtitle in parts of size in proportion with the length of each words, to try to isolate the exact moment a word is pronounced, but results still were not convincing. So far, I do not have an acceptable solution to demonstrate.

Conclusion

In the current state, the aligner is functional and performs rather well on non-fictional content, such as YouTube podcasts, which I used greatly as testing material. However, when tried on longer content such as an actual TV show, with ambient music and such, the results were quite bad. Contributions are welcomed!

Here are some future work ideas:

  • use keyword detection to target specific words from the subtitles in the audio recording (strongly relies on the translation quality, and Pocketsphinx natively does not support sentence search)
  • look for silent parts in the audio and in the subtitles and try to match them together (would not require translation nor transcription, which avoids errors)
  • use embeddings similarity to compare words
  • target proper nouns for the alignment (avoiding the need of translation)
  • use a TF/IDF measure of word importance in the matching process