A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.
As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.
Then I found unoconv. This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.
But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)
Say you have the Council Rules in a Word Document named "test.docx." (For a real-life example, visit http://github.com/vzvenyach/Council_Rules/). Now, you run the following at the command line:
unoconv -f html test.docx
pandoc -f html -t markdown -o test.md test.html
Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. In the end, here's my nicely rendered version of the Rules.
We are developing word documents in chapters and pushing these to Wiki. We are using word due to it's capability to track changes and add comments and users outside of our GitHub space can contribute. Here's a batch script that will convert the latest version of the document in a folder and automatically push to the Wiki. It uses pandoc and magick. It depends on a variable being set where the images are being stored in GitHub.
This can be viewed in our public Repo: department-of-veterans-affairs/ES-ASG
set aImage="https://github.com/department-of-veterans-affairs/ES-ASG/raw/master/Projects/ES%%20ASG/ES%%20ASG%%20API%%20Playbook%%20Project/Content/01.00%%20ASG_API%%20Playbook_Introduction_Section/media/"
rmdir media
rem get the last .docx file
for /R %%f in (*.docx) do set aFile=%%~nf
rem convert docx to mediawiki and extract images
pandoc --extract-media ./ -t mediawiki -o "%aFile%.mediawiki" "%aFile%.docx"
rem Fix up image URLs
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.emf', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.jpeg', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.jpg', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.gif', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.tmp', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace 'File:.//media/', '%aImage%' | Out-File '%aFile%.mediawiki'" -encoding UTF8
rem housekeeping and move Wiki for publishing
del *.bak
copy "%aFile%.mediawiki" "C:\GitHub\ES-ASG.wiki"
rem convert all image files to .png
cd media
for /R %%f in (*.emf) do (
magick %%~nf.emf %%~nf.png
)
for /R %%f in (*.jpeg) do (
magick %%~nf.jpeg %%~nf.png
)
for /R %%f in (*.jpg) do (
magick %%~nf.jpg %%~nf.png
)
for /R %%f in (*.tmp) do (
magick %%~nf.tmp %%~nf.png
)
for /R %%f in (*.gif) do (
magick %%~nf.gif %%~nf.png
)
rem push to GitHub Repo
git add -f --all
git commit -m "Publish"
git push --all
cd "C:\GitHub\ES-ASG.wiki"
git add -f --all
git commit -m "Publish"
git push --all
pause