Skip to content

Instantly share code, notes, and snippets.

@glasslion
Last active March 9, 2025 02:54
This script convert youtube subtitle file(vtt) to plain text.
"""
Convert YouTube subtitles(vtt) to human readable text.
Download only subtitles from YouTube with youtube-dl:
youtube-dl --skip-download --convert-subs vtt <video_url>
Note that default subtitle format provided by YouTube is ass, which is hard
to process with simple regex. Luckily youtube-dl can convert ass to vtt, which
is easier to process.
To conver all vtt files inside a directory:
find . -name "*.vtt" -exec python vtt2text.py {} \;
"""
import sys
import re
def remove_tags(text):
"""
Remove vtt markup tags
"""
tags = [
r'</c>',
r'<c(\.color\w+)?>',
r'<\d{2}:\d{2}:\d{2}\.\d{3}>',
]
for pat in tags:
text = re.sub(pat, '', text)
# extract timestamp, only kep HH:MM
text = re.sub(
r'(\d{2}:\d{2}):\d{2}\.\d{3} --> .* align:start position:0%',
r'\g<1>',
text
)
text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE)
return text
def remove_header(lines):
"""
Remove vtt file header
"""
pos = -1
for mark in ('##', 'Language: en',):
if mark in lines:
pos = lines.index(mark)
lines = lines[pos+1:]
return lines
def merge_duplicates(lines):
"""
Remove duplicated subtitles. Duplacates are always adjacent.
"""
last_timestamp = ''
last_cap = ''
for line in lines:
if line == "":
continue
if re.match('^\d{2}:\d{2}$', line):
if line != last_timestamp:
yield line
last_timestamp = line
else:
if line != last_cap:
yield line
last_cap = line
def merge_short_lines(lines):
buffer = ''
for line in lines:
if line == "" or re.match('^\d{2}:\d{2}$', line):
yield '\n' + line
continue
if len(line+buffer) < 80:
buffer += ' ' + line
else:
yield buffer.strip()
buffer = line
yield buffer
def main():
vtt_file_name = sys.argv[1]
txt_name = re.sub(r'.vtt$', '.txt', vtt_file_name)
with open(vtt_file_name) as f:
text = f.read()
text = remove_tags(text)
lines = text.splitlines()
lines = remove_header(lines)
lines = merge_duplicates(lines)
lines = list(lines)
lines = merge_short_lines(lines)
lines = list(lines)
with open(txt_name, 'w') as f:
for line in lines:
f.write(line)
f.write("\n")
if __name__ == "__main__":
main()
@freeload101
Copy link

freeload101 commented Nov 9, 2021

when i run this with the asterisk, the program only converts one file. not all of them.

use a for loop ? or

find . -iname "*.vtt" -exec python vtt2text.py '{}' \;

Reference: https://github.com/freeload101/SCRIPTS/blob/master/Bash/Stream_to_Text_with_Keywords.sh

@claudchereji
Copy link

find . -iname "*.vtt" -exec python vtt2text.py '{}' \;

how do I run this? sorry I'm still learning, I feel like a script kiddie

@freeload101
Copy link

find . -iname "*.vtt" -exec python vtt2text.py '{}' \;

how do I run this? sorry I'm still learning, I feel like a script kiddie

Well you know what a script kiddie is so your 1/2 way there! Not sure this is the place to have this conversation so hit me up on Discord operat0r#1379 or 404.647.4250 -RMcCurdy.com

@xloem
Copy link

xloem commented Dec 1, 2021

@claudchereji it's a script for a linux terminal . it also not hard to modify the python script so as to handle multiple files.

I had trouble with international characters using this script with python3 (works with python2). seems youtube doesn't use utf-8 for everything. passing encoding='iso-8859-1' to preserve bytes when opening the vtt file fixed this for me. i plan to fork the gist.

@xloem
Copy link

xloem commented Dec 1, 2021

My fork is at https://gist.github.com/xloem/f7ecb8668c14ef07718b4d3447ebe9a2 . This fork handles unexpected encodings and multiple vtt files (@claudchereji ). If people work on this further I request somebody make a git repository for it to track the work.

@ashutoshdubey133
Copy link

Kudos for the awesome work. Just a question, how do I make it such that it removes the time stamp altogether. I don't even want the HH:MM.
Thanks

@xloem
Copy link

xloem commented Dec 16, 2021

It looks like timestamp output is produced by line 66 in this file (yield line after matching a time format), not sure.

@Arkohub
Copy link

Arkohub commented Jun 28, 2022

I am also seeking a way to remove the timestamp. I'm very new to python so I am struggling to follow where I can tweak the code without breaking it. But I think it's falling off somewhere because it's removing duplicates. I tried making another def later on with re.sub but no dice.

@vuslatx
Copy link

vuslatx commented Jul 25, 2022

@haazy
Copy link

haazy commented Nov 9, 2022

Alternative is https://github.com/vuslatx/vtt-to-plain-text

Working great.

This looks like what I want but I am not sure of how to use it.

@freeload101
Copy link

Alternative is https://github.com/vuslatx/vtt-to-plain-text
Working great.

This looks like what I want but I am not sure of how to use it.

if you want to join me on a Stream we can walk though it and record podcast/video for HackerPublicRadio.org ! just hit me up sometime freeload01____yahoo.com

@gala8y
Copy link

gala8y commented Jan 11, 2023

Thanks a lot for the script @glasslion.

@arturmartins
Copy link

Just found out this script after I made this one:
https://gist.github.com/arturmartins/1c78de3e8c21ffce81a17dc2f2181de4

Might be of help to some.

@epogrebnyak
Copy link

Would a command-line tool with interface below be welcome?

yt-text bZ6pA--F3D4 > subtitles.txt

or better with full URL?

yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt

@ibrahimkettaneh
Copy link

ibrahimkettaneh commented Jan 26, 2024

Would a command-line tool with interface below be welcome?

yt-text bZ6pA--F3D4 > subtitles.txt

or better with full URL?

yt-text https://youtu.be/bZ6pA--F3D4 > subtitles.txt

Yes, it would be 😁

EDIT: For anyone interested, https://gist.github.com/epogrebnyak/ba87ba52f779f7ebd93b04b2af1059aa

@epogrebnyak
Copy link

Hi everyone, wrapped this script here: https://github.com/epogrebnyak/justsubs

Sample usage:

from justsubs import Video

subs = Video("KzWS7gJX5Z8").subtitles(language="en-uYU-mmqFLq8")
subs.download()
print(subs.get_text_blocks()[:10])
print(subs.get_plain_text()[:550])

It seems simply "en" does not work, need "en-uYU-mmqFLq8".

@epogrebnyak
Copy link

Also pip install justsubs should work

@florentroques
Copy link

For YouTube subtitles, there were some timestamps and metadata remaining while using the script.

I've fixed it here:
https://gist.github.com/florentroques/c08bbe54fba42ec56c9d48229ed9c49b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment