Last active
September 4, 2020 15:44
-
-
Save aral/4ec009a28527e6c0d07e to your computer and use it in GitHub Desktop.
Quick and dirty regexp that we use to format video transcripts (CoffeeScript)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env coffee | |
fs = require 'fs' | |
# Get the file name that’s passed as the first argument | |
nameOfFile = process.argv.slice(2)[0] | |
# Read the file and convert it from bytes to a string | |
file = fs.readFileSync(nameOfFile).toString() | |
# Strip double empty lines (causes problems for later replacements otherwise) | |
file = file.replace(/\n\n/gm, '\n') | |
# Replace lines starting with Speaker: with the h3 header | |
file = file.replace(/^([^\ \(]*:)/gm, "<h3 class='transcript--name'>$1</h3>") | |
# This weirldy leaves a newline — remove it | |
file = file.replace(/<h3 class='transcript--name'>\n/gm, "<h3 class='transcript--name'>") | |
# Wrap all lines in <p> tags | |
file = file.replace(/^(.*)$/gm, "<p>$1</p>") | |
# Remove the <p> tags from lines that begin with the <h3> and make it prettier by adding an empty line before the <h3>s. | |
file = file.replace(/^<p><h3 class='transcript--name'>(.*)<\/h3>(.*)<\/p>/gm, "\n<h3 class='transcript--name'>$1</h3>\n\n<p>$2</p>") | |
# Replace double spaces after punctuation. Not necessary for HTML but easy enough to do and it looks better in the source. | |
file = file.replace(/[\.\?\!]\ \ /gm, '. ') | |
# Replace ellipsis+dot pairs (found one in Cole’s) | |
file = file.replace(/…\./gm, '… ') | |
# Replace ellipses with HTML entity code | |
file = file.replace(/…/gm, '…') | |
# Wrap (applause) in <em>s | |
file = file.replace(/\(applause\)/g, '<em>(applause)</em>') | |
# Remove empty paragraph tags if any (<p></p>) | |
file = file.replace(/<p><\/p>/gm, '') | |
# Create the new file name by replacing the .txt extension with .html | |
newNameOfFile = nameOfFile.replace('.txt', '.html') | |
# Create and write the updated file string to the new file | |
fs.writeFileSync(newNameOfFile, file) | |
# Alert the user that we’re done | |
console.log ('Formatted transcript: ' + newNameOfFile) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment