Skip to content

Instantly share code, notes, and snippets.

@eyecatchup
Last active May 14, 2021 20:39
Show Gist options
  • Save eyecatchup/8ceed7aaa6694078d7c8 to your computer and use it in GitHub Desktop.
Save eyecatchup/8ceed7aaa6694078d7c8 to your computer and use it in GitHub Desktop.
Convert PDF files to Microsoft Office Word compatible doc/docx files, using LibreOffice's command line interface.
#!C:/Python27/python.exe
#
# Convert PDF files to Microsoft Office Word compatible doc/docx files,
# using LibreOffice's command line interface.
#
# http://stackoverflow.com/questions/26358281/convert-pdf-to-doc-python-bash
# http://ask.libreoffice.org/en/question/20111/converting-files-using-soffice-convert-to-with-embedded-images-html-to-doc/
# http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters
#
import os
import sys
import subprocess
# pdf source file(s) and target paths
basedir = 'C:/path/to'
pdfdir = os.path.normpath(basedir + '/pdf')
docdir = os.path.normpath(basedir + '/doc')
docxdir = os.path.normpath(basedir + '/docx')
# absolute path to libre office writer application
lowriter = 'C:/Progra~2/LibreO~1/program/swriter.exe'
# output-filter for conversion
#outfilter = ':"Office Open XML Text"'
#outfilter = ':"MS Word 2003 XML"'
#outfilter = ':"MS Word 2007 XML"'
#outfilter = ':"MS Word 97"'
outfilter = ''
i = 0
for top, dirs, files in os.walk(pdfdir):
for filename in files:
if filename.endswith('.pdf'):
i = i + 1
abspath_pdf = os.path.normpath(os.path.join(top, filename))
print 'Converting {0} into .doc format..'.format(abspath_pdf)
subprocess.call('{0} --invisible --convert-to doc{1} --outdir "{2}" "{3}"'
.format(lowriter, outfilter, docdir, abspath_pdf), shell=True)
print 'Converting {0} into .docx format..'.format(abspath_pdf)
subprocess.call('{0} --invisible --convert-to docx{1} --outdir "{2}" "{3}"'
.format(lowriter, outfilter, docxdir, abspath_pdf), shell=True)
print '|-------------------------------------------------------|'
print 'Done. Converted {0} pdf files.'.format(i)
@Alf71
Copy link

Alf71 commented Feb 4, 2017

using LibreOffice's command line interface???

@mrfambo
Copy link

mrfambo commented Dec 22, 2017

Anyone active here ?? I need to ask if there is any way of doing the same thing NodeJS

@eyecatchup
Copy link
Author

@alimuqaddas The actual work here is done by Libreoffice. The Python code can be replaced by any other language. Try something like this:

    const exec = require('child_process').exec;
    const lowriter = 'C:/Progra~2/LibreO~1/program/swriter.exe';
    
    const outDir = 'C:/abs_outpath/';
    const srcDir = 'C:/abs_srcpath/';

    let files = ['file1.pdf', 'file2.pdf'];
    files.map(pdffile => {
        let cmd = lowriter + ' --invisible --convert-to docx --outdir "' + outDir + '" "' + srcDir + pdffile '"';
        exec(cmd, function(error, stdout, stderr) {
            // command output is in stdout
        });
    });

@sameerCoder
Copy link

without using liberoffice is there any other ways using python ?

@jainayush007
Copy link

I need the answer for this too please.. Anyone could achieve this without Liberoffice?

@arishtanemi3007
Copy link

Hi, I'm try to convert a pdf which contains both text and images into a word document. Is this approach suitable for it?
Also, is there any alternative to using libreoffice?

@abhishekjha7798
Copy link

abhishekjha7798 commented Jun 18, 2019

Converting /home/user/PycharmProjects/Test/myfile.pdf into .doc format..
/bin/sh: 1: /usr/share/man/man1/lowriter.1.gz: Exec format error

Converting /home/user/PycharmProjects/Test/myfile.pdf into .docx format..
Done. Converted 1 pdf files.
/bin/sh: 1: /usr/share/man/man1/lowriter.1.gz: Exec format error

I was trying to convert a pdf file to doc following this post...
The above mentioned error appeared when I executed the program..

Plz help me with this...
Thank you.....

@PAVITHRA-CP
Copy link

PAVITHRA-CP commented Oct 17, 2019

Can anyone help me for doing the same in ubuntu16.x? I want to convert pdf to docx without loosing font, font-size, bold etc..
Please help for this.
Thank you.

@yashugupta786
Copy link

Can anyone help me for doing the same i. I want to convert pdf to docx without loosing format and layout
Please help for this.
Thank you.

@pptyasar
Copy link

pptyasar commented Feb 3, 2020

@yashgupta any solution you got?

@yashugupta786
Copy link

@pptyasar no as of now i dnt have any solution .do you have any solution for this .Basically i have scanned invoices in pdf and want to convert that to docx format in text format with retaining all the fonts,format and layout

@pptyasar
Copy link

pptyasar commented Feb 24, 2020

@WNT01
Copy link

WNT01 commented Mar 11, 2020

Apart from running on the command line can I run this on jupyter or pycharm?

@amansani
Copy link

amansani commented Jun 9, 2020

I tried converting pdf to docx and it didn't work. Its working from docx to pdf and pdf to ODG.

https://stackoverflow.com/questions/49739245/having-trouble-using-python-and-libreoffice-to-convert-pdf-to-docx-and-doc-to-do

Based on the above link, it seems like libreoffice 6 and higher versions are not supported to convert to doc/docx. Anyone have idea about this?
Please share if we have any other ways to convert pdf to docx using python.

Thank you.

@daominhthuan42
Copy link

Please try to use lid pdf2docx.
Easy to use.
https://dothinking.github.io/pdf2docx/quickstart.convert.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment