Skip to content

Instantly share code, notes, and snippets.

@eightysteele
Created August 6, 2011 02:28
Show Gist options
  • Select an option

  • Save eightysteele/1128928 to your computer and use it in GitHub Desktop.

Select an option

Save eightysteele/1128928 to your computer and use it in GitHub Desktop.
def get_corpus_list():
def wrapper(value, bulkload_state):
"""Returns list of unique words in the entire record.
Arguments:
value - the JSON encoded record
"""
d = bulkload_state.current_dictionary
recjson = simplejson.loads(value)
d.update(recjson)
bulkload_state.current_dictionary = d
corpus = set([x.strip().lower() for concept,x in recjson.iteritems() if concept not in CORPUS_STOP_CONCEPTS and x not in STOP_WORDS])
corpus.update(
reduce(lambda x,y: x+y,
map(lambda x: [s.strip().lower() for s in x.split() if s],
recjson.values()))) # adds tokenized values
return list(corpus)
return wrapper
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment