This Python script is designed to process .vtt
subtitle files obtained using yt-dlp
from YouTube or similar platforms. It merges subtitles with overlapping segments and cleans the text by removing excess whitespace. The script outputs the processed subtitles into a new text file with a timestamped filename.
- Subtitle Merging: Combines multiple subtitle entries into a single entry, considering overlaps.
- Text Cleaning: Cleans subtitle text by replacing newline characters and reducing multiple spaces to a single space.
- Output: Generates a cleaned and merged text file for each
.vtt
file in the specified directory.
- Python 3
webvtt-py
library for parsing.vtt
files
- Obtain Subtitles: First, download the subtitles using
yt-dlp
with the following commands:- For automatic subtitles (machine-generated):
yt-dlp --write-auto-sub --sub-lang ru --skip-download YOUR_URL
- For manual subtitles (provided by the uploader):
yt-dlp --write-subs --sub-lang ru --skip-download YOUR_URL
- For automatic subtitles (machine-generated):
- Script Execution: Place the script in a directory one level above the directory containing the
.vtt
files (or modify thepath
variable as needed). Run the script.
The script outputs processed subtitles in a new file named with the pattern content_YYYY-MM-DD-HH-MM-SS.txt
, where the timestamp corresponds to the script execution time.
Errors during the processing of .vtt
files (such as malformed files) are logged to the console.