-
-
Save glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e to your computer and use it in GitHub Desktop.
""" | |
Convert YouTube subtitles(vtt) to human readable text. | |
Download only subtitles from YouTube with youtube-dl: | |
youtube-dl --skip-download --convert-subs vtt <video_url> | |
Note that default subtitle format provided by YouTube is ass, which is hard | |
to process with simple regex. Luckily youtube-dl can convert ass to vtt, which | |
is easier to process. | |
To conver all vtt files inside a directory: | |
find . -name "*.vtt" -exec python vtt2text.py {} \; | |
""" | |
import sys | |
import re | |
def remove_tags(text): | |
""" | |
Remove vtt markup tags | |
""" | |
tags = [ | |
r'</c>', | |
r'<c(\.color\w+)?>', | |
r'<\d{2}:\d{2}:\d{2}\.\d{3}>', | |
] | |
for pat in tags: | |
text = re.sub(pat, '', text) | |
# extract timestamp, only kep HH:MM | |
text = re.sub( | |
r'(\d{2}:\d{2}):\d{2}\.\d{3} --> .* align:start position:0%', | |
r'\g<1>', | |
text | |
) | |
text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE) | |
return text | |
def remove_header(lines): | |
""" | |
Remove vtt file header | |
""" | |
pos = -1 | |
for mark in ('##', 'Language: en',): | |
if mark in lines: | |
pos = lines.index(mark) | |
lines = lines[pos+1:] | |
return lines | |
def merge_duplicates(lines): | |
""" | |
Remove duplicated subtitles. Duplacates are always adjacent. | |
""" | |
last_timestamp = '' | |
last_cap = '' | |
for line in lines: | |
if line == "": | |
continue | |
if re.match('^\d{2}:\d{2}$', line): | |
if line != last_timestamp: | |
yield line | |
last_timestamp = line | |
else: | |
if line != last_cap: | |
yield line | |
last_cap = line | |
def merge_short_lines(lines): | |
buffer = '' | |
for line in lines: | |
if line == "" or re.match('^\d{2}:\d{2}$', line): | |
yield '\n' + line | |
continue | |
if len(line+buffer) < 80: | |
buffer += ' ' + line | |
else: | |
yield buffer.strip() | |
buffer = line | |
yield buffer | |
def main(): | |
vtt_file_name = sys.argv[1] | |
txt_name = re.sub(r'.vtt$', '.txt', vtt_file_name) | |
with open(vtt_file_name) as f: | |
text = f.read() | |
text = remove_tags(text) | |
lines = text.splitlines() | |
lines = remove_header(lines) | |
lines = merge_duplicates(lines) | |
lines = list(lines) | |
lines = merge_short_lines(lines) | |
lines = list(lines) | |
with open(txt_name, 'w') as f: | |
for line in lines: | |
f.write(line) | |
f.write("\n") | |
if __name__ == "__main__": | |
main() |
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
how do I run this? sorry I'm still learning, I feel like a script kiddie
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
how do I run this? sorry I'm still learning, I feel like a script kiddie
Well you know what a script kiddie is so your 1/2 way there! Not sure this is the place to have this conversation so hit me up on Discord operat0r#1379 or 404.647.4250 -RMcCurdy.com
@claudchereji it's a script for a linux terminal . it also not hard to modify the python script so as to handle multiple files.
I had trouble with international characters using this script with python3 (works with python2). seems youtube doesn't use utf-8 for everything. passing encoding='iso-8859-1'
to preserve bytes when opening the vtt file fixed this for me. i plan to fork the gist.
My fork is at https://gist.github.com/xloem/f7ecb8668c14ef07718b4d3447ebe9a2 . This fork handles unexpected encodings and multiple vtt files (@claudchereji ). If people work on this further I request somebody make a git repository for it to track the work.
Kudos for the awesome work. Just a question, how do I make it such that it removes the time stamp altogether. I don't even want the HH:MM.
Thanks
It looks like timestamp output is produced by line 66 in this file (yield line after matching a time format), not sure.
I am also seeking a way to remove the timestamp. I'm very new to python so I am struggling to follow where I can tweak the code without breaking it. But I think it's falling off somewhere because it's removing duplicates. I tried making another def later on with re.sub but no dice.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.
This looks like what I want but I am not sure of how to use it.
Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.This looks like what I want but I am not sure of how to use it.
if you want to join me on a Stream we can walk though it and record podcast/video for HackerPublicRadio.org ! just hit me up sometime freeload01____yahoo.com
Thanks a lot for the script @glasslion.
Just found out this script after I made this one:
https://gist.github.com/arturmartins/1c78de3e8c21ffce81a17dc2f2181de4
Might be of help to some.
Would a command-line tool with interface below be welcome?
yt-text bZ6pA--F3D4 > subtitles.txt
or better with full URL?
yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt
Would a command-line tool with interface below be welcome?
yt-text bZ6pA--F3D4 > subtitles.txt
or better with full URL?
yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt
Yes, it would be 😁
EDIT: For anyone interested, https://gist.github.com/epogrebnyak/ba87ba52f779f7ebd93b04b2af1059aa
Hi everyone, wrapped this script here: https://github.com/epogrebnyak/justsubs
Sample usage:
from justsubs import Video
subs = Video("KzWS7gJX5Z8").subtitles(language="en-uYU-mmqFLq8")
subs.download()
print(subs.get_text_blocks()[:10])
print(subs.get_plain_text()[:550])
It seems simply "en"
does not work, need "en-uYU-mmqFLq8"
.
Also pip install justsubs
should work
For YouTube subtitles, there were some timestamps and metadata remaining while using the script.
I've fixed it here:
https://gist.github.com/florentroques/c08bbe54fba42ec56c9d48229ed9c49b
use a for loop ? or
find . -iname "*.vtt" -exec python vtt2text.py '{}' \;
Reference: https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh