Hello, I personally was looking for a simple minimal script that performed just this function: parsing vtt, discarding timecodes, merging chronologically close lines into a larger block, and outputting the result in a human-readable txt file. Just wanted to say that in my use case I prefer the way it merges multiple lines into a less-fine-grained time code.
@glasslion, thanks a lot for sharing this script!
vtt2text.py is a nice little script by glasslion I just found that seems to do what I am looking for - convert subtitle file, even closed-captioning "roll-up" style webvtt formats like what I have, into human-friendly full-page transcript.
Here are some usage notes:
# install youtube-dl & clone glasslion's vtt2text.py script
$ git clone https://gist.github.com/glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e caps2txt
Cloning into 'caps2txt'..
$ cd ./caps2txt
$ youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtt "https://www.youtube.com/watch?v=KzWS7gJX5Z8"
[youtube] KzWS7gJX5Z8: Downloading webpage
[info] Writing video subtitles to: ytdl-subs.en.vtt
# 'l' is alias for 'tree --dirsfirst -aFCNL 1'
$ l
.
├── .git/
├── ytdl-subs.en.vtt
└── vtt2text.py
# convert...
$ python3 vtt2text.py ytdl-subs.en.vtt
$ l
.
├── .git/
├── vtt2text.py
├── ytdl-subs.en.txt
└── ytdl-subs.en.vtt
1 directory, 3 files
$ head -n 40 ytdl-subs.en.vtt ytdl-subs.en.txt
==> ytdl-subs.en.vtt <==
WEBVTT
Kind: captions
Language: en
00:03:54.333 --> 00:03:55.201 align:start position:0%
TH<00:03:54.366><c>E </c><00:03:54.399><c>SE</c><00:03:54.433><c>RG</c><00:03:54.466><c>EA</c><00:03:54.500><c>NT</c><00:03:54.533><c> A</c><00:03:54.566><c>T </c><00:03:54.600><c>AR</c><00:03:54.633><c>MS</c><00:03:54.666><c>: </c><00:03:54.700><c>MA</c><00:03:54.733><c>DA</c><00:03:54.766><c>M</c><00:03:55.101><c> </c>
00:03:55.201 --> 00:03:55.334 align:start position:0%
THE SERGEANT AT ARMS: MADAM
00:03:55.334 --> 00:03:57.236 align:start position:0%
THE SERGEANT AT ARMS: MADAM
SP<00:03:55.367><c>EA</c><00:03:55.401><c>KE</c><00:03:55.434><c>R,</c><00:03:55.468><c> T</c><00:03:55.501><c>HE</c><00:03:56.102><c> V</c><00:03:56.135><c>IC</c><00:03:56.168><c>E </c><00:03:56.202><c>PR</c><00:03:56.235><c>ES</c><00:03:56.268><c>ID</c><00:03:56.302><c>EN</c><00:03:56.335><c>T </c><00:03:56.368><c>AN</c><00:03:56.402><c>D</c><00:03:57.103><c> </c>
00:03:57.236 --> 00:03:57.369 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND
00:03:57.369 --> 00:07:49.535 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND
TH<00:03:57.403><c>E </c><00:03:57.436><c>UN</c><00:03:57.470><c>IT</c><00:03:57.503><c>ED</c><00:03:57.536><c> S</c><00:03:57.570><c>TA</c><00:03:57.603><c>TE</c><00:03:57.636><c>S </c><00:03:57.670><c>SE</c><00:03:57.703><c>NA</c><00:03:57.736><c>TE</c><00:03:57.770><c>.</c>
00:07:49.535 --> 00:07:50.603 align:start position:0%
TH<00:07:49.568><c>E </c><00:07:49.601><c>SP</c><00:07:49.635><c>EA</c><00:07:49.668><c>KE</c><00:07:49.702><c>R:</c><00:07:49.735><c> T</c><00:07:49.768><c>HE</c><00:07:50.303><c> H</c><00:07:50.336><c>OU</c><00:07:50.369><c>SE</c><00:07:50.403><c> C</c><00:07:50.436><c>OM</c><00:07:50.469><c>ES</c><00:07:50.503><c> </c>
00:07:50.603 --> 00:07:50.736 align:start position:0%
THE SPEAKER: THE HOUSE COMES
00:07:50.736 --> 00:07:54.773 align:start position:0%
THE SPEAKER: THE HOUSE COMES
TO<00:07:50.770><c>UR</c><00:07:50.803><c>ED</c><00:07:50.836><c> F</c><00:07:50.870><c>OR</c><00:07:51.304><c> T</c><00:07:51.337><c>HI</c><00:07:51.370><c>S</c><00:07:54.506><c> I</c><00:07:54.540><c>MP</c><00:07:54.573><c>OR</c><00:07:54.606><c>TA</c><00:07:54.640><c>NT</c><00:07:54.673><c>, </c>
00:07:54.773 --> 00:07:54.907 align:start position:0%
TOURED FOR THIS IMPORTANT,
==> ytdl-subs.en.txt <==
00:03
THE SERGEANT AT ARMS: MADAM SPEAKER, THE VICE PRESIDENT AND
00:07
THE UNITED STATES SENATE. THE SPEAKER: THE HOUSE COMES
TOURED FOR THIS IMPORTANT, HISTORIC MEETING. LET US REMIND THAT EACH SIDE,
00:08
HOUSE AND SENATE, DEMOCRATS AND REPUBLICANS, EACH HAVE 11
MEMBERS ALLOWED TO BE PRESENT ON THE FLOOR. OTHERS MAY BE IN THE GALLERY.
THIS IS AT THE GUIDANCE OF THE OFFICIATING -- ATTENDING
PHYSICIAN AND THE SERGEANT AT ARMS. THE GENTLEMAN ON THE REPUBLICAN
SIDE OF THE AISLE WILL PLEASE OBSERVE THE SOCIAL DISTANCING
AND AGREE TO WHAT WE HAVE, 11 MEMBERS ON EACH SIDE, SO THAT --
RESPONSIBILITIES TO THIS CHAMBER, TO THIS RESPONSIBILITY, AND TO THIS HOUSE OF
REPRESENTATIVES. PLEASE EXIT THE FLOOR IF YOU DO
NOT HAVE AN ASSIGNED ROLE FROM YOUR LEADERSHIP.
YOU CAN SHARE WITH YOUR STAFF IF YOU WANT TO HAVE A FEW MORE, BUT
00:09
YOU CANNOT BE TOGETHER ON THE FLOOR OF THE HOUSE WITH THAT
MANY PEOPLE IN HERE. I'LL THANK THE SENATE AND THOSE -- LET'S GO.
LET'S JUST START. >> MADAM SPEAKER. VICE PRESIDENT PENCE: MADAM
SPEAKER, MEMBERS OF CONGRESS, PURSUANT TO THE CONSTITUTION AND
THE LAWS OF THE UNITED STATES, THE SENATE AND HOUSE OF
REPRESENTATIVES ARE MEETING IN JOINT SESSION TO VERIFY THE
CERTIFICATES AND COUNT THE VOTES OF THE ELECTORS IN THE SEVERAL
STATES FOR PRESIDENT AND VICE PRESIDENT OF THE UNITED STATES.
AFTER ASCERTAINMENT HAS BEEN HAD, CORRECT IN FORM, THE
TELLERS WILL COUNT AND MAKE A LIST OF THE VOTES CAST BY THE
00:10
ELECTORS OF THE SEVERAL STATES. THE TELLERS ON THE PART OF THE
TWO HOUSES HAVE TAKEN THEIR PLACES AT THE CLERK'S DESK.
WITHOUT OBJECTION, THE TELLERS WILL DISPENSED WITH THE READING
OF THE FORMAL PORTIONS OF THE CERTIFICATES. AFTER ASCERTAINING THAT THE
CERTIFICATES ARE REGULAR IN FORM AND AUTHENTIC, THE TELLERS WILL
ANNOUNCE THE VOTES CAST BY THE ELECTORS FOR EACH STATE,
BEGINNING WITH ALABAMA. WHICH THE PARLIAMENTARIANS ADVISE ME IS THE ONLY
@Crowdscriber/caption-parser - scala vtt parser that dedupes cues in roll-up style captions
check out the implementation of how deduping works - state machine + regex matcher that descriminates roll-up cues from finished ones.
@bausano and others, if you want more control over the parsing and the structure of the output format, check out the webvtt-py
python package. I learned about it from a blog post written by William Morgan.
He wrote a tutorial showing how to programatically fetch vtt caption files from google/youtube in bulk, then use webvtt
and pandas
dataframe in python to parse and extract the caption content, including formatting it into tidy csv files to use as a downstream NLP corpus. Sounds like just what you are looking for...
Creating an NLP data set from YouTube subtitles. William Morgan Mar 8, 2019·12 min read
This project started out just like most data science projects do: collecting data. In my case I needed subtitles from videos on YouTube. Not just any videos, but videos of math lectures. The idea was to process the subtitles using NLP techniques and build a classifier that could differentiate subjects in mathematics. In this article I will show you both of the ways I like to “scrape” subtitles from YouTube videos: Manually downloading and cleaning the subtitles. Programmatically obtaining the subtitles using the API and youtube -dl.
from https://medium.com/@morga046/creating-an-nlp-data-set-from-youtube-subtitles-fb59c0955c2
# code details from W Morgan (python):
# First, we need a list of the .vtt files:
filenames_vtt = [os.fsdecode(file) for file in os.listdir(os.getcwd()) if os.fsdecode(file).endswith(".vtt")]
#Check file names
filenames_vtt[:2]
# Then, we write a function to extract the information and store it.
import webvtt
def convert_vtt(filenames):
#create an assets folder if one does not yet exist
if os.path.isdir('{}/assets'.format(os.getcwd())) == False:
os.makedirs('assets')
#extract the text and times from the vtt file
for file in filenames:
captions = webvtt.read(file)
text_time = pd.DataFrame()
text_time['text'] = [caption.text for caption in captions]
text_time['start'] = [caption.start for caption in captions]
text_time['stop'] = [caption.end for caption in captions]
text_time.to_csv('assets/{}.csv'.format(file[:-4]),index=False) #-4 to remove '.vtt'
#remove files from local drive
os.remove(file)
another option: node-webvtt
nice coding style but no attempt to deal with duplicate cues.
here's a browser code sandbox to play around in: https://frontarm.com/demoboard/?id=344821fa-577d-42ed-939c-8d6468d7685c
another option: @plussub/srt-vtt-parser"
it is well-written in typescript with minimal dependancies, but not so obvious without diving into the source how to implement de-duplication. will look at other libs in meantime.
var srtVttParser = require("@plussub/srt-vtt-parser")
/*
* note, the webvtt files I've been working begin with the required WEBVTT line but also have two lines of metadata.
* see https://github.com/osk/node-webvtt#metadata for background and code that works properly with it. art-vtt-parser
* chokes if these are present.
*
* typical header of file from youtube-dl:
*
* WEBVTT
* Kind: captions
* Language: en
*
* <timecode>...
*/
let input = `WEBVTT
00:03:54.333 --> 00:03:55.201 align:start position:0%
TH<00:03:54.366><c>E </c><00:03:54.399><c>SE</c><00:03:54.433><c>RG</c><00:03:54.466><c>EA</c><00:03:54.500><c>NT</c><00:03:54.533><c> A</c><00:03:54.566><c>T </c><00:03:54.600><c>AR</c><00:03:54.633><c>MS</c><00:03:54.666><c>: </c><00:03:54.700><c>MA</c><00:03:54.733><c>DA</c><00:03:54.766><c>M</c><00:03:55.101><c> </c>
00:03:55.201 --> 00:03:55.334 align:start position:0%
THE SERGEANT AT ARMS: MADAM
00:03:55.334 --> 00:03:57.236 align:start position:0%
THE SERGEANT AT ARMS: MADAM
SP<00:03:55.367><c>EA</c><00:03:55.401><c>KE</c><00:03:55.434><c>R,</c><00:03:55.468><c> T</c><00:03:55.501><c>HE</c><00:03:56.102><c> V</c><00:03:56.135><c>IC</c><00:03:56.168><c>E </c><00:03:56.202><c>PR</c><00:03:56.235><c>ES</c><00:03:56.268><c>ID</c><00:03:56.302><c>EN</c><00:03:56.335><c>T </c><00:03:56.368><c>AN</c><00:03:56.402><c>D</c><00:03:57.103><c> </c>
00:03:57.236 --> 00:03:57.369 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND
00:03:57.369 --> 00:07:49.535 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND
TH<00:03:57.403><c>E </c><00:03:57.436><c>UN</c><00:03:57.470><c>IT</c><00:03:57.503><c>ED</c><00:03:57.536><c> S</c><00:03:57.570><c>TA</c><00:03:57.603><c>TE</c><00:03:57.636><c>S </c><00:03:57.670><c>SE</c><00:03:57.703><c>NA</c><00:03:57.736><c>TE</c><00:03:57.770><c>.</c>
00:07:49.535 --> 00:07:50.603 align:start position:0%
TH<00:07:49.568><c>E </c><00:07:49.601><c>SP</c><00:07:49.635><c>EA</c><00:07:49.668><c>KE</c><00:07:49.702><c>R:</c><00:07:49.735><c> T</c><00:07:49.768><c>HE</c><00:07:50.303><c> H</c><00:07:50.336><c>OU</c><00:07:50.369><c>SE</c><00:07:50.403><c> C</c><00:07:50.436><c>OM</c><00:07:50.469><c>ES</c><00:07:50.503><c> </c>
00:07:50.603 --> 00:07:50.736 align:start position:0%
THE SPEAKER: THE HOUSE COMES
00:07:50.736 --> 00:07:54.773 align:start position:0%
THE SPEAKER: THE HOUSE COMES
TO<00:07:50.770><c>UR</c><00:07:50.803><c>ED</c><00:07:50.836><c> F</c><00:07:50.870><c>OR</c><00:07:51.304><c> T</c><00:07:51.337><c>HI</c><00:07:51.370><c>S</c><00:07:54.506><c> I</c><00:07:54.540><c>MP</c><00:07:54.573><c>OR</c><00:07:54.606><c>TA</c><00:07:54.640><c>NT</c><00:07:54.673><c>, </c>
00:07:54.773 --> 00:07:54.907 align:start position:0%
TOURED FOR THIS IMPORTANT,
00:07:54.907 --> 00:07:55.708 align:start position:0%
TOURED FOR THIS IMPORTANT,
HI<00:07:54.940><c>ST</c><00:07:54.973><c>OR</c><00:07:55.007><c>IC</c><00:07:55.475><c> M</c><00:07:55.508><c>EE</c><00:07:55.541><c>TI</c><00:07:55.575><c>NG</c><00:07:55.608><c>.</c>
00:07:55.708 --> 00:07:55.842 align:start position:0%
HISTORIC MEETING.
`
console.log(JSON.stringify(srtVttParser.parse(input)), null, 2)
result:
{
"entries": [
{
"id": "",
"from": 234333,
"to": 235201,
"text": "TH<00:03:54.366><c>E </c><00:03:54.399><c>SE</c><00:03:54.433><c>RG</c><00:03:54.466><c>EA</c><00:03:54.500><c>NT</c><00:03:54.533><c> A</c><00:03:54.566><c>T </c><00:03:54.600><c>AR</c><00:03:54.633><c>MS</c><00:03:54.666><c>: </c><00:03:54.700><c>MA</c><00:03:54.733><c>DA</c><00:03:54.766><c>M</c><00:03:55.101><c> </c>"
},
{
"id": "",
"from": 235201,
"to": 235334,
"text": "THE SERGEANT AT ARMS: MADAM "
},
{
"id": "",
"from": 235334,
"to": 237236,
"text": "THE SERGEANT AT ARMS: MADAM \nSP<00:03:55.367><c>EA</c><00:03:55.401><c>KE</c><00:03:55.434><c>R,</c><00:03:55.468><c> T</c><00:03:55.501><c>HE</c><00:03:56.102><c> V</c><00:03:56.135><c>IC</c><00:03:56.168><c>E </c><00:03:56.202><c>PR</c><00:03:56.235><c>ES</c><00:03:56.268><c>ID</c><00:03:56.302><c>EN</c><00:03:56.335><c>T </c><00:03:56.368><c>AN</c><00:03:56.402><c>D</c><00:03:57.103><c> </c>"
},
{
"id": "",
"from": 237236,
"to": 237369,
"text": "SPEAKER, THE VICE PRESIDENT AND "
},
{
"id": "",
"from": 237369,
"to": 469535,
"text": "SPEAKER, THE VICE PRESIDENT AND \nTH<00:03:57.403><c>E </c><00:03:57.436><c>UN</c><00:03:57.470><c>IT</c><00:03:57.503><c>ED</c><00:03:57.536><c> S</c><00:03:57.570><c>TA</c><00:03:57.603><c>TE</c><00:03:57.636><c>S </c><00:03:57.670><c>SE</c><00:03:57.703><c>NA</c><00:03:57.736><c>TE</c><00:03:57.770><c>.</c>"
},
{
"id": "",
"from": 469535,
"to": 470603,
"text": "TH<00:07:49.568><c>E </c><00:07:49.601><c>SP</c><00:07:49.635><c>EA</c><00:07:49.668><c>KE</c><00:07:49.702><c>R:</c><00:07:49.735><c> T</c><00:07:49.768><c>HE</c><00:07:50.303><c> H</c><00:07:50.336><c>OU</c><00:07:50.369><c>SE</c><00:07:50.403><c> C</c><00:07:50.436><c>OM</c><00:07:50.469><c>ES</c><00:07:50.503><c> </c>"
},
{
"id": "",
"from": 470603,
"to": 470736,
"text": "THE SPEAKER: THE HOUSE COMES "
},
{
"id": "",
"from": 470736,
"to": 474773,
"text": "THE SPEAKER: THE HOUSE COMES \nTO<00:07:50.770><c>UR</c><00:07:50.803><c>ED</c><00:07:50.836><c> F</c><00:07:50.870><c>OR</c><00:07:51.304><c> T</c><00:07:51.337><c>HI</c><00:07:51.370><c>S</c><00:07:54.506><c> I</c><00:07:54.540><c>MP</c><00:07:54.573><c>OR</c><00:07:54.606><c>TA</c><00:07:54.640><c>NT</c><00:07:54.673><c>, </c>"
},
{
"id": "",
"from": 474773,
"to": 474907,
"text": "TOURED FOR THIS IMPORTANT, "
},
{
"id": "",
"from": 474907,
"to": 475708,
"text": "TOURED FOR THIS IMPORTANT, \nHI<00:07:54.940><c>ST</c><00:07:54.973><c>OR</c><00:07:55.007><c>IC</c><00:07:55.475><c> M</c><00:07:55.508><c>EE</c><00:07:55.541><c>TI</c><00:07:55.575><c>NG</c><00:07:55.608><c>.</c>"
},
{
"id": "",
"from": 475708,
"to": 475842,
"text": "HISTORIC MEETING."
}
]
}
---
## misc notes and research on webvtt format and conventions
basically, most tools and users assume the file is more like a subtitle file - polished, no duplicate lines, etc - but
youtube, c-span, ahd many other video producers that are more oriented towards producing and sharing live broadcasts
produce less polished "roll-up style live captioning" files that are valid webvtt but include a lot of repeated lines.
- https://gist.github.com/glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e
- this script works
- **https://www.reddit.com/r/youtubedl/comments/jvn6jx/how_to_convert_vtt_subtitles_into_human_readable/**
- https://github.com/glut23/webvtt-py
- https://medium.com/@morga046/creating-an-nlp-data-set-from-youtube-subtitles-fb59c0955c2
- https://python-pytube.readthedocs.io/en/latest/user/quickstart.html#subtitle-caption-tracks
- https://www.ccextractor.org/public:gsoc:subtitle_extractor_technical_docs
- https://github.com/jdepoix/youtube-transcript-api#cli
- https://github.com/TimEllis/vttprocessor
- hosted: https://www.lancaster.ac.uk/staff/ellist/vtttocaqdas.html
overlapping cue timing in webvtt
- gets messy in live captioning that build cues incrementally
- https://github.com/w3c/webvtt/issues/318
- these older pro format conventions follow `CEA608` ?
- https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html
- https://github.com/Dash-Industry-Forum/cea608.js/blob/aa3d036106f3f06aaebea57c470b70f238683f11/lib/cea608-towebvtt.js
- pop-on vs paint-on vs roll-up caption modes & fed regulations & webvtt
- https://www.w3.org/community/texttracks/wiki/RollupCaptions
- *videojs/http-streaming: fix: VTTCues with identical time intervals being incorrectly removed*
- https://github.com/videojs/http-streaming/pull/1005
- > We can meet those criteria by only removing cues that have identical time intervals and identical text. This will ensure we remove any cues that overlap VTT segments, while keeping any cues that are actually intended to be displayed at the same time (which we can reasonably assume will have different text)
- **caption-parser: WebVTT De-duping**
- > "A lot of times you aren't using a caption display mechanism that supports multi-line rollup captions. In these situations, you really want to "de-duplicate" the captions by only keeping one line of captions. SubtitleUtil provides a vttToSubtitles convenience method that lets you control whether or not captions are de-duped."
- https://github.com/crowdscriber/caption-parser/#webvtt-de-duping
-
- SRT vs TTML vs webvtt
- https://mux.com/blog/subtitles-captions-webvtt-hls-and-those-magic-flags/
---
https://stackoverflow.com/a/54818581
<!-- language-all: none -->
Another option is to use `youtube-dl`:
youtube-dl --skip-download --write-auto-sub $youtube_url
The default format is `vtt` and the other available format is `ttml` (`--sub-format ttml`).
--write-sub
Write subtitle file
--write-auto-sub
Write automatically generated subtitle file (YouTube only)
--all-subs
Download all the available subtitles of the video
--list-subs
List all available subtitles for the video
--sub-format FORMAT
Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"
--sub-lang LANGS
Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags
You can use `ffmpeg` to convert the subtitle file to another format:
ffmpeg -i input.vtt output.srt
This is what the VTT subtitles look like:
WEBVTT
Kind: captions
Language: en
00:00:01.429 --> 00:00:04.249 align:start position:0%
ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>
00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
</c>
00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>
00:00:05.930 --> 00:00:05.940 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
</c>
00:00:05.940 --> 00:00:07.730 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>
00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice
00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>
00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
</c>
00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>
00:00:11.030 --> 00:00:11.040 align:start position:0%
<c.colorE5E5E5>the circumstances that's brought us
</c>
Here are the same subtitles without the part at the top of the file and without tags:
00:00:01.429 --> 00:00:04.249 align:start position:0%
ladies and gentlemen I'd like to thank
00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen I'd like to thank
00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen I'd like to thank
you for coming tonight especially at
00:00:05.930 --> 00:00:05.940 align:start position:0%
you for coming tonight especially at
00:00:05.940 --> 00:00:07.730 align:start position:0%
you for coming tonight especially at
such short notice
00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice
00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm sure mr. Irving will fill you in on
00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure mr. Irving will fill you in on
00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure mr. Irving will fill you in on
the circumstances that's brought us
You can see that each subtitle text is repeated three times. There is a new subtitle text every eighth line (3rd, 11th, 19th, and 27th).
This converts the VTT subtitles to a simpler format:
sed '1,/^$/d' *.vtt| # remove the part at the top
sed 's/<[^>]*>//g'| # remove tags
awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3' # print each new subtitle text and its start time without milliseconds
This is what the output of the command above looks like:
00:00:01 ladies and gentlemen I'd like to thank
00:00:04 you for coming tonight especially at
00:00:05 such short notice
00:00:07 I'm sure mr. Irving will fill you in on
00:00:09 the circumstances that's brought us
This prints the closed captions of a video in the simplified format:
`cap()(cd /tmp;rm -f -- *.vtt;youtube-dl --skip-download --write-auto-sub -- "$1";sed '1,/^$/d' -- *.vtt|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3')`
The command below downloads the captions of all videos on a channel. When there is an error like `Unable to extract video data`, `-i` (`--ignore-errors`) causes `youtube-dl` to skip the video instead of exiting with an error.
`youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(id)s.%(ext)s' https://www.youtube.com/channel/$channelid;for f in *.vtt;do sed '1,/^$/d' "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3'>"${f%.vtt}";done`