Skip to content

Instantly share code, notes, and snippets.

@JockDaRock
Forked from geekbleek/docx2md.md
Last active September 12, 2024 12:11
Show Gist options
  • Save JockDaRock/2c5c9b5c0622461357a787170428e041 to your computer and use it in GitHub Desktop.
Save JockDaRock/2c5c9b5c0622461357a787170428e041 to your computer and use it in GitHub Desktop.
Convert a Word Document into MD

Converting a Word Document to Markdown in One Move

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

Installing Pandoc

On a mac you can use homebrew by running the command brew install pandoc.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." Pandoc can convert from markdown into .docx, and it also works in the other direction.

Example

The bash script below will take an existing .docx file, convert it to markdown, and export all media in the word doc to a sub folder, and update the markdown links to these relative paths. In addition, it will use strict github flavored markdown styling, for use with Github or Cisco DevNet PubHub publishing tools.

To use:

  1. Installed pandocs with brew install pandoc on a Mac, choco install pandoc for Windows. For Windows users, if you need help installing the package manager, please see this guide here https://medium.com/@JockDaRock/installing-the-chocolatey-package-manager-for-windows-3b1bdd0dbb49.
  2. Download the bash script below
  3. Run it as such ./docx2md.sh filename - Do not pass the file name extension, and it must be in the same folder as the executable.
#!/bin/bash
#
# generate a Markdown version of a word document. Goes in separate folder, since
# images are extracted and converted as well (separate folder avoids naming clashes).
#
# REQUIREMENTS: pandoc
#
#
# with pandoc
# --extract-media=[media folder]
#
# USAGE:
#
# docx2md.sh filename #(no extension)
#
# This will generate a converted file in a subfolder, for example if you have a file
# `contract.docx`:
#
# generates:
# ```
# contract
# ├── contract.md
# └── media
# ├── image1.png
# ├── image2.png
# └── image3.png
# ```
#
# Author: Jesper Rønn-Jensen 2015-11-30
# https://gist.github.com/jesperronn/ff5764274b3642bc7f2f
# Inspired by https://gist.github.com/aembleton/1eb889bc443996a508df
#
which pandoc > /dev/null
rc=$?
if [[ $rc != 0 ]]; then
echo "FATAL missing pandoc. You can install with 'brew install pandoc' or similar"
exit 9
fi
if [ -z "$1" ]; then
echo "Usage:"
echo ""
echo " docx2md.sh [filename-no-extension]"
exit 13
fi
if [ ! -f "$1.docx" ]; then
echo "FATAL missing file '$1.docx'"
exit 11
fi
mkdir -p "$1"
cd "$1"
pandoc -f docx -t gfm --extract-media="." -o "$1.md" "../$1.docx"
@ashraf21c
Copy link

ashraf21c commented Oct 20, 2023

Hello, thank you, but Pandoc link for windows does not work.
This is the new link: https://github.com/jgm/pandoc/releases/latest
Or for information: https://pandoc.org/installing.html

In addition, there`s an update.. Pandoc now supports Linux too, and its version can be downloaded from mentioned link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment