This document outlines some ideas for document conversion on Linux and Mac OS X platforms using command line tools. Distribute documents as plain text using UTF-8 encoding whenever possible. Everyone should embrace the mantra "plain text is beautiful".
Use file command to obtain basic metadata for most file formats. For image files make sure you have ImageMagick installed, then use identify command to extract image metadata.
Use iconv command to convert plain text from one encoding to another. The basic usage is
$ iconv -c -f <source_encoding> -t <target_encoding> input.txt > output.txt
The -c option discards unconvertible characters, and pointy brackets denote required options. For a list of supported encodings run
$ iconv -l
Poppler library (https://poppler.freedesktop.org/), based on Xpdf, comes with a suite of PDF tools. Use pdftotext command to extract text from PDF file, assuming a text layer exists.
Use html2text command (http://www.mbayer.de/html2text/) to extract text from HTML file.
DjVuLibre (http://djvu.sourceforge.net/), an open source DjVu library and viewer, comes with a suite of command line utilities. Use the djvutxt command to extract text from DjVu, assuming a text layer exsits.
UnRTF (https://www.gnu.org/software/unrtf/)
Install xml-twig-tools package.
Use xml_grep to extract text from XML document:
xml_grep example.xml --text_only
Extract text only from mytag tag:
xml_grep 'mytag' example.xml --text_only
Use textutil command to convert plain text to rtf, rtfd, html, doc, docx, odt, wordml, and webarchive formats. The -info option extracts basic metadata from files of these formats. textutil is based on the Cocoa Framework, so it isn't available on Linux.
Use cupsfilter command to convert non-PDF formats to PDF.
Use enscript command (http://www.linuxfromscratch.org/blfs/view/svn/pst/enscript.html) to convert text files to PostScript, HTML, and RTF. Unfortunately, enscript does not support UTF-8 encoding.
Use paps command (http://paps.sourceforge.net/) to format UTF-8 plain text files. paps requires the Pango library (http://www.pango.org/).
Use pandoc command to convert amongst popular markup formats:
http://pandoc.org/
Note that pandoc supports the newer XML-based docx MS Word format but not the older OLE-based doc MS Word format.
Use textutil command to convert among txt, rtf, rtfd, html, doc, docx, odt, wordml, and webarchive formats.
Use cupsfilter command to convert TXT to PDF and HTML to PDF.
If you have LibreOffice installed on your system, you can run soffice command in headless mode to convert documents:
$ soffice --headless --convert-to <TargetFileExtension>[:<NameOfFilter>] input_file.xxx
Note that the square brackets around :<NameOfFilter> mean that this part is optional. The output file will be named input_file.TargetFileExtension. On Windows command line, the convert-to parameter uses only one dash.
Please refer to LibreOffice documentation for details: https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
Use pstopdf command to convert PostScript to PDF.
Use djvutoxml command from DjVuLibre library (http://djvu.sourceforge.net/) to convert DjVu to XML.
Use UnRTF to convert RTF files to HTML files. UnRTF also supports LaTeX and ASCII plain text output.