Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Convert JSON to an immutable representation so that equality can be tested | |
without regard for order.""" | |
import json | |
class decoder(json.JSONDecoder): | |
# http://stackoverflow.com/questions/10885238/python-change-list-type-for-json-decoding | |
def __init__(self, list_type=list, **kwargs): | |
json.JSONDecoder.__init__(self, **kwargs) | |
# Use the custom JSONArray |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# requires sidsl: | |
# git clone https://github.com/simongog/sdsl-lite.git | |
# cd sdsl-lite | |
# ./install.sh $HOME/.local | |
# uses pv to display progress (not essential) | |
# http://www.ivarch.com/programs/pv.shtml | |
all: fm-index indices |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: UTF-8 -*- | |
"""Preprocessing of text files. | |
Writes one paragraph per line, and normalizes punctuation & whitespace. | |
No sentence or word tokenization. | |
Usage: preprocess.py [FILE] | |
or: preprocess.py --batch FILES... | |
By default, produce cleaned version given a single filename to standard output. | |
Diagnostic information is written to standard error. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Extract several BOW models from a corpus of text files. | |
The models are stored in Matrix Market format which can be read | |
by gensim. The texts are read from .txt files in the directory | |
specified as TOPDIR. The output is written to the current directory.""" | |
# NB: All strings are utf8 (not unicode). | |
import os | |
import glob | |
import nltk | |
import gensim |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Benchmark of indexing of line offsets in text file. | |
Usage example: | |
>>> index = indexfile_iter('1027.txt') | |
>>> index[5] | |
115 | |
>>> import bisect | |
>>> bisect.bisect(index, 115) - 1 | |
5 |