Created
August 30, 2020 12:04
-
-
Save davidcortesortuno/64723e4262889f592def55c1927db651 to your computer and use it in GitHub Desktop.
Remove duplicated lines from a .vtt file generated by youtube-dl when downaloading auto generated Youtube subtitles
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Remove duplicated lines from a .vtt file generated by youtube-dl when | |
# downloading auto-subs from a Youtube video using the --write-auto-sub option | |
# This script only prints the lines so save the edited subs as: | |
# | |
# python this_script.py original_sub.vtt > new_sub.vtt | |
import re | |
import sys | |
f = open(sys.argv[1]) | |
patt = re.compile(r'^\d\d:\d\d:\d\d', re.M) | |
dup_line = '' | |
for line in f: | |
# line = f.readline() | |
# Find a line starting with a time stamp: 00:13:23 ... | |
res = re.findall(patt, line) | |
if res: | |
# If so, print this line and read the next line which we save to | |
# store the result in dup_line. | |
# In the next loop, If we find another sections starting with a timestamp, | |
# the dup_line will be matched with the line below. If True, just pass | |
# and do not print the duplicated line | |
# Else, read another pattern to match a duplicated line | |
print(line, end='') | |
next_line = f.readline() | |
if dup_line and next_line == dup_line: | |
dup_line = '' | |
res = [] | |
continue | |
else: | |
dup_line = next_line | |
print(dup_line, end='') | |
res = [] | |
else: | |
print(line, end='') | |
f.close() |
It is not working for me either.
1021
00:20:18,630 --> 00:20:20,540
aquí desde medellín me despido nos vemos
1022
00:20:20,540 --> 00:20:20,550
aquí desde medellín me despido nos vemos
i am interested in solving this issue but i need your help. Could u please share with me the urls of the videos of which vtt file have duplicated lines?
Yes, please share a video that can be tested to update the code :D
This is the video that generated dup lines for me:
https://www.youtube.com/watch?v=ubOqOCukR40&t=941s&ab_channel=GabrielHerrera
Where you able to find a fix? This is becoming a problem for us. Attached is a .vtt file from YouTube that needs duplicates fixed: https://drive.google.com/file/d/163Y-rg2qouJOQ2rjeudQ3dAFrE7TF_2M/view?usp=sharing
Hello, is it oke?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
is nto working for me i have this....
03:15.654 --> 03:20.325
granting Kyonan University,
a mere private university,
03:15.654 --> 03:20.325
granting Kyonan University,
a mere private university,
it does not remove duplicate.....