Andreas van Cranenburgh andreasvc

andreasvc / TopicModeling.ipynb

Created October 23, 2014 20:51

Topic Modeling with gensim. Load in ipython notebook or view online: http://nbviewer.ipython.org/gist/andreasvc/66fe7547b05569c9a273

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

andreasvc / DH-crash-course-riddle.ipynb

Last active August 29, 2015 14:08

Genre Classification with a Bag-of-Words model. See http://nbviewer.ipython.org/gist/andreasvc/5d9b17fb981ee2a8b728

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

andreasvc / 1027.txt.mrg.gz

Last active December 30, 2022 00:35

A tutorial on using tree fragments for text classification. http://nbviewer.ipython.org/gist/andreasvc/9467e27680d8950045b2

View raw

andreasvc / jsoneq.py

Last active August 29, 2015 14:09

Unordered equality test of JSON data

	"""Convert JSON to an immutable representation so that equality can be tested
	without regard for order."""
	import json


	class decoder(json.JSONDecoder):
	# http://stackoverflow.com/questions/10885238/python-change-list-type-for-json-decoding
	def __init__(self, list_type=list, **kwargs):
	json.JSONDecoder.__init__(self, **kwargs)
	# Use the custom JSONArray

andreasvc / Makefile

Last active August 29, 2015 14:09


	# requires sidsl:
	# git clone https://github.com/simongog/sdsl-lite.git
	# cd sdsl-lite
	# ./install.sh $HOME/.local

	# uses pv to display progress (not essential)
	# http://www.ivarch.com/programs/pv.shtml

	all: fm-index indices

andreasvc / treedraw.ipynb

Created December 9, 2014 12:04

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

andreasvc / Fragments in TSG derivations.ipynb

Last active July 4, 2018 15:12

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

andreasvc / preprocess.py

Last active February 8, 2022 09:28

	# -- coding: UTF-8 --
	"""Preprocessing of text files.
	Writes one paragraph per line, and normalizes punctuation & whitespace.
	No sentence or word tokenization.

	Usage: preprocess.py [FILE]
	or: preprocess.py --batch FILES...

	By default, produce cleaned version given a single filename to standard output.
	Diagnostic information is written to standard error.

andreasvc / bow.py

Created July 8, 2015 15:58

Extract Bag-of-Words (BOW) models from a corpus of text files.

	"""Extract several BOW models from a corpus of text files.

	The models are stored in Matrix Market format which can be read
	by gensim. The texts are read from .txt files in the directory
	specified as TOPDIR. The output is written to the current directory."""
	# NB: All strings are utf8 (not unicode).
	import os
	import glob
	import nltk
	import gensim

andreasvc / lineidx.py

Last active August 29, 2015 14:26

Benchmark of indexing of line offsets in text file.

	"""Benchmark of indexing of line offsets in text file.

	Usage example:

	>>> index = indexfile_iter('1027.txt')
	>>> index[5]
	115
	>>> import bisect
	>>> bisect.bisect(index, 115) - 1
	5