-
-
Save earth3300/8eb436cbfaeb911f62db449a174ded3d to your computer and use it in GitHub Desktop.
export all LibreOffice or MS-Word files to html, uses OpenOffice or Libre Office and tidy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# odf2html.sh - export all odt od doc files to html by shellscript | |
# Inspired by Marco Fioretti, | |
# who wrote this Article in 2012: http://www.techrepublic.com/blog/linux-and-open-source/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/ | |
# Klaus Kappel | |
CONFIG=tidy.conf | |
# first clean the dir, if script runs in cron job | |
# rm -rf $2 | |
mkdir -p $2 | |
for F in `find $1 -type f -name "*.doc" -or -name "*.odt"` | |
do | |
BASE=`basename $F .doc` ; BASE=`basename $BASE .odt` | |
libreoffice --headless --convert-to html --outdir $2 $F | |
tidy -q -config $CONFIG -f $2/$BASE.err -i $2/$BASE.htm | sed 's/ class="c[0-9]*"//g' > $2/$BASE.html | |
done |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Tidy.conf to clean LibreOffice Html Output | |
clean: yes | |
drop-proprietary-attributes: yes | |
drop-empty-paras: yes | |
output-html: yes | |
input-encoding: utf8 | |
output-encoding: utf8 | |
join-classes: yes | |
join-styles: yes | |
show-body-only: yes | |
force-output: yes |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment