by Philip Herron
Birmingham: Packt Publishing, 2013, available in print and as ebook; this review is based on the PDF, 110 pp.
Reviewed by
Andreas van Cranenburgh
University of Amsterdam
// ==UserScript== | |
// @name Gmane vertical frames | |
// @namespace [email protected] | |
// @include http://news.gmane.org/* | |
// @include http://thread.gmane.org/* | |
// @version 1 | |
// @grant none | |
// ==/UserScript== | |
// The default GMane 'news' view has horizontal panes which wastes lots of screen space; |
"""Extract metadata from Project Gutenberg RDF catalog into a Python dict. | |
Based on https://bitbucket.org/c-w/gutenberg/ | |
>>> md = readmetadata() | |
>>> md[123] | |
{'LCC': {'PS'}, | |
'author': u'Burroughs, Edgar Rice', | |
'authoryearofbirth': 1875, | |
'authoryearofdeath': 1950, |
"""Apply PCA to a CSV file and plot its datapoints (one per line). | |
The first column should be a category (determines the color of each datapoint), | |
the second a label (shown alongside each datapoint).""" | |
import sys | |
import pandas | |
import pylab as pl | |
from sklearn import preprocessing | |
from sklearn.decomposition import PCA |
These scripts produce the train-dev-test splits for the Tiger & Lassy treebanks
used in my 2013 IWPT paper. The Tiger treebank version 2.1 was used, namely
tiger_release_aug07.export
. The Lassy treebank was version 1.1, or
lassy-r19749
. The reason for not just taking the last 20% for the
development & test set is to ensure a balanced distribution of sentences, which
otherwise would have an uneven distribution of length & topics.
""" A simple multiprocessing example with process pools, shared data and | |
per-process initialization. """ | |
import multiprocessing | |
# global read-only data can be shared by each process | |
DATA = 11 | |
def initworker(a): | |
""" Initialize data specific to each process. """ | |
global MOREDATA |
""" Classify rows from CSV files with SVM with leave-one-out cross-validation; | |
labels taken from first column, of the form 'label_description'. """ | |
import sys | |
import pandas | |
from sklearn import svm, cross_validation, preprocessing | |
data = pandas.read_csv(sys.argv[1]) | |
xdata = data.as_matrix(data.columns[1:]) | |
#xdata = preprocessing.scale(xdata) # normalize data => mean of 0, stddev of 1 | |
ylabels = [a.split('_')[0] for a in data.icol(0)] | |
ytarget = preprocessing.LabelEncoder().fit(ylabels).transform(ylabels) |