Created
March 21, 2017 17:45
-
-
Save cuevasclemente/9f4cff87760a4e85d3b9d6aa74c46eb8 to your computer and use it in GitHub Desktop.
Awk for parsing through wikipedia articles separated by lines of five equal signs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# articles look like: | |
# Title | |
# Article text... | |
# .... | |
# ===== | |
# FS="\n", set the field separator to be newlines, used to get the title (which will be $2) | |
# RS="=====" set the record separtor to be five equal signs | |
# gsub("/", "_", $2): replace all forward slashes with underscores in the title line, needed so that we don't upset anyone | |
# if statement checks to see if there is already a file with the title we're looking at | |
# print statement redirects all the output for a particular record to <title>.txt. We need the .txt because I don't | |
# know how to get awk to just write to a file without passing a string as part of the redirection output. We also | |
# print the article title for monitoring | |
cat ../out | | |
awk '{FS="\n";RS"=====";gsub("/", "_", $2);if(system("[ -e "$2".txt ]") == 0) {print > $2 ".txt"; print $2 }}' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment