Created
August 30, 2020 12:04
-
-
Save davidcortesortuno/64723e4262889f592def55c1927db651 to your computer and use it in GitHub Desktop.
Remove duplicated lines from a .vtt file generated by youtube-dl when downaloading auto generated Youtube subtitles
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Remove duplicated lines from a .vtt file generated by youtube-dl when | |
# downloading auto-subs from a Youtube video using the --write-auto-sub option | |
# This script only prints the lines so save the edited subs as: | |
# | |
# python this_script.py original_sub.vtt > new_sub.vtt | |
import re | |
import sys | |
f = open(sys.argv[1]) | |
patt = re.compile(r'^\d\d:\d\d:\d\d', re.M) | |
dup_line = '' | |
for line in f: | |
# line = f.readline() | |
# Find a line starting with a time stamp: 00:13:23 ... | |
res = re.findall(patt, line) | |
if res: | |
# If so, print this line and read the next line which we save to | |
# store the result in dup_line. | |
# In the next loop, If we find another sections starting with a timestamp, | |
# the dup_line will be matched with the line below. If True, just pass | |
# and do not print the duplicated line | |
# Else, read another pattern to match a duplicated line | |
print(line, end='') | |
next_line = f.readline() | |
if dup_line and next_line == dup_line: | |
dup_line = '' | |
res = [] | |
continue | |
else: | |
dup_line = next_line | |
print(dup_line, end='') | |
res = [] | |
else: | |
print(line, end='') | |
f.close() |
Where you able to find a fix? This is becoming a problem for us. Attached is a .vtt file from YouTube that needs duplicates fixed: https://drive.google.com/file/d/163Y-rg2qouJOQ2rjeudQ3dAFrE7TF_2M/view?usp=sharing
Hello, is it oke?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is the video that generated dup lines for me:
https://www.youtube.com/watch?v=ubOqOCukR40&t=941s&ab_channel=GabrielHerrera