Skip to content

Instantly share code, notes, and snippets.

@earth3300
Forked from kkappel/odf2html.sh
Created June 5, 2019 23:18
Show Gist options
  • Save earth3300/8eb436cbfaeb911f62db449a174ded3d to your computer and use it in GitHub Desktop.
Save earth3300/8eb436cbfaeb911f62db449a174ded3d to your computer and use it in GitHub Desktop.
export all LibreOffice or MS-Word files to html, uses OpenOffice or Libre Office and tidy
#!/bin/bash
# odf2html.sh - export all odt od doc files to html by shellscript
# Inspired by Marco Fioretti,
# who wrote this Article in 2012: http://www.techrepublic.com/blog/linux-and-open-source/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/
# Klaus Kappel
CONFIG=tidy.conf
# first clean the dir, if script runs in cron job
# rm -rf $2
mkdir -p $2
for F in `find $1 -type f -name "*.doc" -or -name "*.odt"`
do
BASE=`basename $F .doc` ; BASE=`basename $BASE .odt`
libreoffice --headless --convert-to html --outdir $2 $F
tidy -q -config $CONFIG -f $2/$BASE.err -i $2/$BASE.htm | sed 's/ class="c[0-9]*"//g' > $2/$BASE.html
done
# Tidy.conf to clean LibreOffice Html Output
clean: yes
drop-proprietary-attributes: yes
drop-empty-paras: yes
output-html: yes
input-encoding: utf8
output-encoding: utf8
join-classes: yes
join-styles: yes
show-body-only: yes
force-output: yes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment