Skip to content

Instantly share code, notes, and snippets.

@plembo
Last active September 12, 2025 13:07
Show Gist options
  • Save plembo/409a8d7b1bae66622dbcd26337bbb185 to your computer and use it in GitHub Desktop.
Save plembo/409a8d7b1bae66622dbcd26337bbb185 to your computer and use it in GitHub Desktop.
Convert docx to markdown with pandoc

Convert Word documents to markdown with pandoc

I use pandoc to convert masses of Word documents to markdown. Still working on a generic script, but for now here's the "gist" of what I type into the terminal:

$ myfilename="example"
$ pandoc \
-t markdown_strict \
--extract-media='./attachments/$myfilename' \
$myfilename.docx \
-o $myfilename.md

Pandoc markdown is nice, but with Word documents it often adds odd things in translation. Stick to markdown_strict to avoid that.

I try to organize media (images, etc) embedded in documents under an attachments subdirectory with folders named for each file. This helps avoid "collision" between media file names and makes conversion out of markdown into other formats (HTML, PDF) less messy.

@mattman-ps
Copy link

Nice. Thanks for the tips ;-)

@Mjboothaus
Copy link

Thanks

@iambumblehead
Copy link

works perfectly

@mrtngrsbch
Copy link

cool, nice gist !

@STrRedWolf
Copy link

This helps get me 90% of the way there. I use a mix of 'markdown+bracketed_spans+backtick_code_blocks+fenced_code_attributes+fenced_divs' but I have to manually re-add the []{custom-style="foobar"} code as well as the horizontal lines... well, close...

@brucegl
Copy link

brucegl commented Jun 27, 2025

beautiful!

@hochun836
Copy link

thanks a lot !

@amalytix
Copy link

amalytix commented Aug 25, 2025

In case someone needs to process multiple files in a given directory called input-files this helped me:

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

BASE_DIR="input-files"
ATTACH_ROOT="attachments"

# Ensure base attachment dir exists
mkdir -p "$ATTACH_ROOT"

# Find all .docx files under BASE_DIR (recursively), handling spaces safely
find "$BASE_DIR" -type f -name '*.docx' -print0 | while IFS= read -r -d '' docx; do
  # Relative path (without BASE_DIR/ prefix if present)
  rel="$docx"
  case "$rel" in
    "$BASE_DIR"/*) rel="${rel#"$BASE_DIR"/}" ;;
  esac

  # Strip extension
  rel_noext="${rel%.*}"

  # Build a filesystem-safe prefix from the relative path:
  # - lowercase
  # - replace any non [a-z0-9] with '-'
  # - collapse multiple '-' and trim leading/trailing '-'
  prefix="$(printf '%s' "$rel_noext" \
            | tr '[:upper:]' '[:lower:]' \
            | sed -E 's/[^a-z0-9]+/-/g; s/-+/-/g; s/^-+//; s/-+$//')"

  media_dir="$ATTACH_ROOT/$prefix"
  mkdir -p "$media_dir"

  # Output Markdown path: same folder next to the .docx, same basename with .md. No spaces in filename.
  md_out="${docx%.*}.md"
  md_out=$(echo "$md_out" | sed 's/ /-/g')

  echo "Converting: $docx"
  echo "  -> Markdown: $md_out"
  echo "  -> Media:    $media_dir"

  pandoc -t markdown_strict \
    --extract-media="$media_dir" \
    "$docx" \
    -o "$md_out"
done

echo "Done."
  1. Save as convert-docx.sh
  2. Make executable: chmod +x convert-docx.sh
  3. Run: ./convert-docx.sh

@plembo
Copy link
Author

plembo commented Aug 27, 2025

Thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment