Andreas van Cranenburgh andreasvc

Learning Cython Programming

by Philip Herron

Birmingham: Packt Publishing, 2013, available in print and as ebook; this review is based on the PDF, 110 pp.

Reviewed by
Andreas van Cranenburgh
University of Amsterdam

Tiger & Lassy train-dev-test splits

These scripts produce the train-dev-test splits for the Tiger & Lassy treebanks used in my 2013 IWPT paper. The Tiger treebank version 2.1 was used, namely tiger_release_aug07.export. The Lassy treebank was version 1.1, or lassy-r19749. The reason for not just taking the last 20% for the development & test set is to ensure a balanced distribution of sentences, which otherwise would have an uneven distribution of length & topics.

	// ==UserScript==
	// @name Gmane vertical frames
	// @namespace [email protected]
	// @include http://news.gmane.org/*
	// @include http://thread.gmane.org/*
	// @version 1
	// @grant none
	// ==/UserScript==

	// The default GMane 'news' view has horizontal panes which wastes lots of screen space;

	"""Extract metadata from Project Gutenberg RDF catalog into a Python dict.

	Based on https://bitbucket.org/c-w/gutenberg/

	>>> md = readmetadata()
	>>> md[123]
	{'LCC': {'PS'},
	'author': u'Burroughs, Edgar Rice',
	'authoryearofbirth': 1875,
	'authoryearofdeath': 1950,

	"""Apply PCA to a CSV file and plot its datapoints (one per line).

	The first column should be a category (determines the color of each datapoint),
	the second a label (shown alongside each datapoint)."""
	import sys
	import pandas
	import pylab as pl
	from sklearn import preprocessing
	from sklearn.decomposition import PCA

	""" A simple multiprocessing example with process pools, shared data and
	per-process initialization. """
	import multiprocessing

	# global read-only data can be shared by each process
	DATA = 11

	def initworker(a):
	""" Initialize data specific to each process. """
	global MOREDATA

	""" Classify rows from CSV files with SVM with leave-one-out cross-validation;
	labels taken from first column, of the form 'label_description'. """
	import sys
	import pandas
	from sklearn import svm, cross_validation, preprocessing
	data = pandas.read_csv(sys.argv[1])
	xdata = data.as_matrix(data.columns[1:])
	#xdata = preprocessing.scale(xdata) # normalize data => mean of 0, stddev of 1
	ylabels = [a.split('_')[0] for a in data.icol(0)]
	ytarget = preprocessing.LabelEncoder().fit(ylabels).transform(ylabels)