Skip to content

Instantly share code, notes, and snippets.

@kratsg
Last active July 19, 2021 20:34
Show Gist options
  • Save kratsg/0d2918f1c3f16573d727d898695a966b to your computer and use it in GitHub Desktop.
Save kratsg/0d2918f1c3f16573d727d898695a966b to your computer and use it in GitHub Desktop.
Aeneas Alignment

Aligning Text with Aeneas

I've installed aeneas and I want to split text. Here's what I did. Taking the raw transcript from my captioner, I convert to plain text (on Mac OSX using textutil):

textutil 2021-01-28\ Machine\ Learning.rtf -convert txt

This gives me 2021-01-28\ Machine\ Learning.txt which I then process with split.py above (requires nltk):

python split.py

which gives me new_parsed.txt. Why split/fragment the text? aeneas in particular assumes the text you're working with is already fragmented... so we need to do that ourselves (see split.py). I was inspired by readbeyond/aeneas#242.

I can then run aeneas using this to get an srt file I can encode with handbrake:

$ python -m aeneas.tools.execute_task zoom_0.mp4 new_parsed.txt "task_language=eng|os_task_file_format=srt|is_text_type=plain" map.srt
[INFO] Validating config string (specify --skip-validator to bypass)...
[INFO] Validating config string... done
[INFO] Creating task...
[INFO] Creating task... done
[INFO] Executing task...
[INFO] Executing task... done
[INFO] Creating output sync map file...
[INFO] Creating output sync map file... done
[INFO] Created file 'map.srt'

and I'm done!

$ cat split.py
import re
output= open("new_parsed.txt","a+")
from nltk.tokenize import word_tokenize
path = '2021-01-28 Machine Learning.txt'
n = 10 #number of words in each fragment
f= open(path,'r')
g=f.readlines()
for x in g:
if (len(x.split())) > n:
str1 = ""
i = 1
for ele in x.split():
str1 += ele+' '
i += 1
if (i) % n == 0:
output.write(str1+'\n')
str1 = ''
if (i-1) == len(x.split()):
output.write(str1+'\n')
elif (((len(x.split())) > 0) and ( (len(x.split())) <= n )):
output.write(x+'\n')
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment