Last active
August 26, 2022 10:22
-
-
Save jdittrich/480caeca4bc36db07a09b88cb7151f8e to your computer and use it in GitHub Desktop.
hacky little script to convert plain text transcripts that have timestamps in the beginning of lines into documents that can be imported to otranscribe (https://otranscribe.com/)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
hacky little script to convert plain text transcripts | |
that have timestamps in the beginning of lines | |
into documents that can be imported to otranscribe | |
The file format it takes looks like: | |
1:30 I mean yeah uhm | |
1:32 I dunno | |
Or, more tech-y, the following regex needs to match each line: ^(\d+):(\d+)(.*)$ | |
where (\d+) is minutes (\d+) is seconds and (.*)$ is the rest of the line. | |
""" | |
import re | |
import sys | |
import fileinput | |
lines = None | |
with open(sys.argv[1], 'r') as f: | |
lines = f.readlines() | |
def convertLineToOTR(line): | |
timeMatch = re.search("^(\d+):(\d+)(.*)$",line) | |
minutes = int(timeMatch.group(1)) | |
seconds = int(timeMatch.group(2)) | |
timestring = str(minutes)+":"+str(seconds) | |
text = timeMatch.group(3) | |
allSeconds = (minutes*60)+seconds | |
return f'<p><span class=\\"timestamp\\" data-timestamp=\\"{allSeconds}\\">{timestring}</span>{text}<br/></p>' # double escape cause " will mess with JSON, so we need "\" in the output | |
new_lines = list(map(convertLineToOTR,lines)) #type cast cause return type of map is map | |
# how to build a JSON the terrible way | |
header = ['{"text":"'] | |
footer = ['","media":"please reload file","media-time":"000"}'] | |
fullList = header+new_lines+footer | |
newFile = "".join(fullList) | |
sys.stdout.write(newFile) |
If your files start somewhat different, e.g. with "(12:34) text text…" instead of "12:34 text text…", you need to change the regex (^(\d+):(\d+)(.*)$
). Use https://www.regexpal.com/ or some other tool, so you do not need to run the script again and again.
If there is interest in this, I could invest an hour to make it work in javascript so you could convert using a website rather than a python script.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
No, this is not a good example for how to create a JSON (use https://docs.python.org/3/library/json.html)