Created
December 16, 2013 22:04
-
-
Save thequbit/7995274 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def sendtoelasticsearch(self,targeturl,docurl,pdftext,pdfhash,scrapedatetime): | |
Where: | |
targeturl = "http://henrietta.org/" # the root url that was scraped | |
docurl = "http://henrietta.org/mydoc.pdf" # the url of the pdf document | |
pdftext = <the converted text of the pdf> | |
pdfhash = md5(pdftext) | |
scrapedatetime = the datetime of when the pdf was scraped/downloaded | |
I would like to send the text to elastic search for it to be indexed so I can perform queies on it from a website. | |
Using this example: | |
http://www.elasticsearch.org/blog/unleash-the-clients-ruby-python-php-perl/#python | |
It looks like that is exactly what i want to do ... but the es.index() function takes in an id ... that appears to be a number. Can this be the docurl?? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment