Created
February 9, 2014 18:55
-
-
Save JGVerdugo/8904290 to your computer and use it in GitHub Desktop.
Extracts text from multiple documents using Apache Tika
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import glob | |
import os | |
# USAGE: | |
# 1. Download the Tika command prompt tool from http://tika.apache.org/download.html. | |
# 2. Put some files in the same directory. | |
# 3. Put this script in the same directory (make sure you have Python). | |
# 4. In the command line, write "python dotika.py". | |
# If Tika can extract your files, a new file with the extension .new | |
# will be created for each file matching the "extension" filter (see | |
# the code below). This script does nothing but automating the extraction | |
# process. | |
# | |
# Here are the default values. | |
# If you need a different format or encoding, change these values. | |
# Be sure to read this first: http://tika.apache.org/1.4/gettingstarted.html | |
# (especially the "Using Tika as a command line utility"). | |
encoding = "UTF-8" | |
outputformat = "--text" | |
extension = "*.doc" | |
files = glob.glob(extension) | |
for file in files: | |
newfile = file + ".new" | |
print newfile | |
os.system("java -jar tika-app-1.4.jar %s --encoding=%s %s > %s" % (outputformat, encoding, file, newfile)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I wrote this simple script to help colleagues take advantage of the Tika functionality without having to use a programming language. Since none of us are developers (we are linguists), I wanted to keep it as simple and self-explanatory as possible.
He escrito este sencillo script para que mis compañeros puedan aprovechar la funcionalidad de Tika sin necesidad de programar. Dado que ninguno de nosotros es programador (somos lingüistas), me ha parecido conveniente usar una estructura lo más sencilla y clara posible.