Created
December 9, 2011 05:02
-
-
Save vwood/1450254 to your computer and use it in GitHub Desktop.
Short Python3 Search Engine
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# Short indexer and search engine, 2011 Vaughn Wood | |
# Written in python 3 | |
# Assumes the collection is a newline separated file called index on the current path | |
# Only uses TF (so perform queries accordingly) | |
from functools import * | |
def dict_join_add_values(d, e): | |
"""Given two dictionaries, add the items in the second to the first. | |
If the same key exists in both, the associated values (integers) are added together.""" | |
return d.update({k: e[k]+d.get(k,0) for k in e.keys()}) or d | |
def make_tf_list(doc, docno): | |
"""Create a dictionary of terms and their frequency in a given document.""" | |
return {k: {docno: v} for k,v in reduce(dict_join_add_values, [{w: 1} for w in doc], {}).items()} | |
def dict_join_documents(d, e): | |
"""Given two dictionaries, add the items in the second to the first. | |
If the same key exists in both, the associated values (dicts) are unioned together.""" | |
return d.update({k: (e[k].update(d.get(k,{})) or e[k]) for k in e.keys()}) or d | |
def index(docs): | |
"""Given a list of documents (lists of words), create a dictionary of terms to | |
postings (document ids and term frequencies)""" | |
return reduce(dict_join_documents, [make_tf_list(doc, i) for i, doc in enumerate(docs)], {}) | |
def search(terms, index): | |
"""Join the postings for each term in the query, then sort them according to their score.""" | |
return reversed(sorted(list(reduce(dict_join_add_values, [index.get(term, {}) for term in terms], {}).items()), key=lambda x: x[1])) | |
idx = index([d.split() for d in open("index").read().split('\n')]) | |
while (True): | |
[print("%s\t%s"%(d,s)) for d, s in search(input('> ').split(), idx)] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I resent Guido's view that reduce is bad. Neither is it true that "There aren't a whole lot of associative operators." It would be nice if I could have punned the usage of + as both integer addition for dict_join_add_values and union for dict_join_documents.
As written, these are all expressions, and as such the program could fit on a single line, if:
1, Reduce was properly supported by python
2, We removed the loop (index and then search for a single query)
3, The definition of dict_join_add_values gets repeated.
4, It would be nice if methods like d.update would return the updated dictionary. I'm aware they do this to warn that it is a destructive operation (used here for efficiency) but appending ! to the names of such functions is much better.
If anyone could add IDF calculation and make it use tf.idf, I'd be grateful.
Finally, as this is a list of transforms between kinds of structures, I would love to see this done in a declarative style. Unfortunately I'm unaware of a suitable language.