Skip to content

Instantly share code, notes, and snippets.

@conholdate-gists
Last active October 10, 2024 03:02
Show Gist options
  • Save conholdate-gists/79a521084b9713201f9efa2d90460e9f to your computer and use it in GitHub Desktop.
Save conholdate-gists/79a521084b9713201f9efa2d90460e9f to your computer and use it in GitHub Desktop.
Extract Text from Word Documents using Java

Learn how to extract text from Word documents using Java: https://blog.conholdate.com/2021/10/13/extract-text-from-word-documents-using-java/

The following topics are discussed/covered in this article:

  1. Extract Text from Word Documents using Java
  2. Extract Text from Document Pages using Java
  3. Get Text From Specific Index using Java
  4. Extract Formatted Text from DOCX using Java
  5. Extract Text by Table of Contents using Java
// Create an instance of Parser class
try (Parser parser = new Parser("C:\\Files\\sample.docx")) {
// Extract a formatted text into the reader
try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
// Print a formatted text from the document
// If formatted text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
}
}
// Create an instance of Parser class
try (Parser parser = new Parser("C:\\Files\\sample.docx")) {
// Extract a highlight:
HighlightItem hl = parser.getHighlight(0, true, new HighlightOptions(8));
// Check if highlight extraction is supported
if (hl == null) {
System.out.println("Highlight extraction isn't supported");
return;
}
// Print an extracted highlight
System.out.println(String.format("At %d: %s", hl.getPosition(), hl.getText()));
}
// Create an instance of Parser class
Parser parser = new Parser("C:\\Files\\sample.docx");
// Extract a raw text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
// Create an instance of Parser class
Parser parser = new Parser("C:\\Files\\sample.docx");
// Check if the document supports text extraction
if (!parser.getFeatures().isText()) {
System.out.println("The document doesn't support text extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("The document has zero pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page number: %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p)) {
// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
// Create an instance of Parser class
try (Parser parser = new Parser("C:\\Files\\sampleTOC.docx")) {
// Get table of contents
Iterable<TocItem> tocItems = parser.getToc();
// Check if toc extraction is supported
if (tocItems == null) {
System.out.println("Table of contents extraction isn't supported");
}
else
{
// Iterate over items
for (TocItem tocItem : tocItems) {
// Print the text of the chapter
try (TextReader reader = tocItem.extractText()) {
System.out.println("----");
System.out.println(reader.readToEnd());
}
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment