Skip to content

Instantly share code, notes, and snippets.

@algotrader-dotcom
Last active October 15, 2015 06:52
Show Gist options
  • Save algotrader-dotcom/8fb35ad8a07add009ee6 to your computer and use it in GitHub Desktop.
Save algotrader-dotcom/8fb35ad8a07add009ee6 to your computer and use it in GitHub Desktop.
Apache Tika: extract metadata, text from PDF, XLS, Doc
# 1. Installation
git clone https://github.com/apache/tika.git
cd tika
export M2_HOME=/opt/apache-maven/apache-maven-2.2.1
mvn install
....
# 2. Usage
## Standlone commandline
Usage: java -jar tika-app.jar /path/to/file
Start tika at server mode (using tmux for keep session)
java -jar tika-server-1.11-SNAPSHOT.jar --host=locahost --port=12345
# 3. Client
Python: https://github.com/chrismattmann/tika-python
PHP: ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment