Skip to content

Instantly share code, notes, and snippets.

@andreasvc
andreasvc / TopicModeling.ipynb
Created October 23, 2014 20:51
Topic Modeling with gensim. Load in ipython notebook or view online: http://nbviewer.ipython.org/gist/andreasvc/66fe7547b05569c9a273
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@andreasvc
andreasvc / 1027.txt.mrg.gz
Last active December 30, 2022 00:35
A tutorial on using tree fragments for text classification. http://nbviewer.ipython.org/gist/andreasvc/9467e27680d8950045b2
@andreasvc
andreasvc / jsoneq.py
Last active August 29, 2015 14:09
Unordered equality test of JSON data
"""Convert JSON to an immutable representation so that equality can be tested
without regard for order."""
import json
class decoder(json.JSONDecoder):
# http://stackoverflow.com/questions/10885238/python-change-list-type-for-json-decoding
def __init__(self, list_type=list, **kwargs):
json.JSONDecoder.__init__(self, **kwargs)
# Use the custom JSONArray
# requires sidsl:
# git clone https://github.com/simongog/sdsl-lite.git
# cd sdsl-lite
# ./install.sh $HOME/.local
# uses pv to display progress (not essential)
# http://www.ivarch.com/programs/pv.shtml
all: fm-index indices
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# -*- coding: UTF-8 -*-
"""Preprocessing of text files.
Writes one paragraph per line, and normalizes punctuation & whitespace.
No sentence or word tokenization.
Usage: preprocess.py [FILE]
or: preprocess.py --batch FILES...
By default, produce cleaned version given a single filename to standard output.
Diagnostic information is written to standard error.
@andreasvc
andreasvc / bow.py
Created July 8, 2015 15:58
Extract Bag-of-Words (BOW) models from a corpus of text files.
"""Extract several BOW models from a corpus of text files.
The models are stored in Matrix Market format which can be read
by gensim. The texts are read from .txt files in the directory
specified as TOPDIR. The output is written to the current directory."""
# NB: All strings are utf8 (not unicode).
import os
import glob
import nltk
import gensim
@andreasvc
andreasvc / lineidx.py
Last active August 29, 2015 14:26
Benchmark of indexing of line offsets in text file.
"""Benchmark of indexing of line offsets in text file.
Usage example:
>>> index = indexfile_iter('1027.txt')
>>> index[5]
115
>>> import bisect
>>> bisect.bisect(index, 115) - 1
5