brew install youtube-dl
pip install pysrt beautifulsoup4
pip install --pre ttconv
Download the subtitles in ttml
format and rename the file to subtitles.ttml
.
youtube-dl --write-subs https://www.bbc.com/news/world-us-canada-65452940
Convert the subtitles to srt
format.1
tt convert -i subtitles.ttml -o subtitles.srt
Read subtitles from srt
file, remove all formatting (e.g. font tags) and save as plain text.
import pysrt
from bs4 import BeautifulSoup
subs = pysrt.open("subtitles.srt")
html_text = "\n".join([sub.text for sub in subs])
soup = BeautifulSoup(html_text, 'lxml')
plain_text = soup.get_text()
with open("subtitles.txt", "w") as text_file:
text_file.write(plain_text)
1. youtube-dl
provides --convert-subs
which could be used to extract subtitles in srt
format, but ttconv
automatically removes unnecessary line breaks