Skip to content

Instantly share code, notes, and snippets.

@davidcortesortuno
Created August 30, 2020 12:04
Show Gist options
  • Save davidcortesortuno/64723e4262889f592def55c1927db651 to your computer and use it in GitHub Desktop.
Save davidcortesortuno/64723e4262889f592def55c1927db651 to your computer and use it in GitHub Desktop.
Remove duplicated lines from a .vtt file generated by youtube-dl when downaloading auto generated Youtube subtitles
# Remove duplicated lines from a .vtt file generated by youtube-dl when
# downloading auto-subs from a Youtube video using the --write-auto-sub option
# This script only prints the lines so save the edited subs as:
#
# python this_script.py original_sub.vtt > new_sub.vtt
import re
import sys
f = open(sys.argv[1])
patt = re.compile(r'^\d\d:\d\d:\d\d', re.M)
dup_line = ''
for line in f:
# line = f.readline()
# Find a line starting with a time stamp: 00:13:23 ...
res = re.findall(patt, line)
if res:
# If so, print this line and read the next line which we save to
# store the result in dup_line.
# In the next loop, If we find another sections starting with a timestamp,
# the dup_line will be matched with the line below. If True, just pass
# and do not print the duplicated line
# Else, read another pattern to match a duplicated line
print(line, end='')
next_line = f.readline()
if dup_line and next_line == dup_line:
dup_line = ''
res = []
continue
else:
dup_line = next_line
print(dup_line, end='')
res = []
else:
print(line, end='')
f.close()
@volehuy1998
Copy link

Hello, is it oke?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment