Skip to content

Instantly share code, notes, and snippets.

@prabhatkashyap
Last active January 15, 2017 07:47
Show Gist options
  • Select an option

  • Save prabhatkashyap/96081df2abcc50a59d69efa0ed6fc0b8 to your computer and use it in GitHub Desktop.

Select an option

Save prabhatkashyap/96081df2abcc50a59d69efa0ed6fc0b8 to your computer and use it in GitHub Desktop.
Apache Tika Parse Document (Doc, Docx, PDF and Many More)
Download Apache Tika Jar: http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar
Example: https://tika.apache.org/1.8/examples.html
CommandLine: java -jar tika-app-1.14.jar test.docx > test.html
Java Code:
String target = "File Path";
File document = new File(target);
Parser parser = new AutoDetectParser();
// ContentHandler handler = new BodyContentHandler();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
try {
parser.parse(new FileInputStream(document), handler, metadata, new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
// System.out.println(metadata);
System.out.println(handler.toString());
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment