Skip to content

Instantly share code, notes, and snippets.

@kzinmr
Forked from etienned/extractdocx.py
Created July 16, 2019 03:52
Show Gist options
  • Save kzinmr/9ba94eff5f8be39248276c2396a891da to your computer and use it in GitHub Desktop.
Save kzinmr/9ba94eff5f8be39248276c2396a891da to your computer and use it in GitHub Desktop.
Simple function to extract text from MS XML Word document (.docx) without any dependencies.
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
@kzinmr
Copy link
Author

kzinmr commented Jul 16, 2019

modify WORD_NAMESPACE by checking the contents of tree (like '{http://schemas.microsoft.com/office/word/2003/wordml}' )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment