Skip to content

Instantly share code, notes, and snippets.

@hanigamal
Created June 16, 2014 01:59
Show Gist options
  • Save hanigamal/39154bac4f5ed1e9978b to your computer and use it in GitHub Desktop.
Save hanigamal/39154bac4f5ed1e9978b to your computer and use it in GitHub Desktop.
#!/bin/sh
filename="$1"
extension=${filename##*.}
if [ ${extension} = "doc" ]
then
# Use cat doc for older document formats
catdoc "$filename"
else
# Unzip the docx file, and grab just the text with sed
# This also replaces opening <w:r> tags with newlines
# The final `sed G` double spaces the output
unzip -p "$filename" | grep '<w:r' | sed 's/<w:p[^<\/]*>/ \
/g' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$' | sed G
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment