Skip to content

Instantly share code, notes, and snippets.

@amferraz
Created May 8, 2014 12:09
Show Gist options
  • Save amferraz/bded9e9b28f416d20943 to your computer and use it in GitHub Desktop.
Save amferraz/bded9e9b28f416d20943 to your computer and use it in GitHub Desktop.
A sample app of converting a file to html with Apache Tika
package jusbrasil.test_tika;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.ExpandedTitleContentHandler;
import org.xml.sax.SAXException;
import com.google.common.io.Files;
public class Main {
public static void main(String[] args) throws IOException, TransformerConfigurationException, SAXException,
TikaException {
byte[] file = Files.toByteArray(new File("/path/to/my/file.doc"));
AutoDetectParser tikaParser = new AutoDetectParser();
ByteArrayOutputStream out = new ByteArrayOutputStream();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(out));
ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
tikaParser.parse(new ByteArrayInputStream(file), handler1, new Metadata());
System.out.println(new String(out.toByteArray(), "UTF-8"));
}
}
@avni1209
Copy link

avni1209 commented Jul 1, 2016

![Uploading a2a0c430-411a-0133-0a47-0e76e5725d9

d.gif…]()
1.

@avni1209
Copy link

avni1209 commented Jul 1, 2016

does this work for pdf and work?

@abhishekchaudhary996
Copy link

How can we do it in scala

@vjvipulvj
Copy link

tried this code. but what I see is that formatting is not that good when it compares html there, and I think it's better when we do the same with apache poi conversion, but problem in apache poi is that I am not getting proper solution for .docx to .html conversion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment