Skip to content

Instantly share code, notes, and snippets.

View emlyn's full-sized avatar

Emlyn Corrin emlyn

View GitHub Profile
@emlyn
emlyn / hashtest.clj
Last active December 26, 2015 10:28
Clojure Hash Test
(def testset (for [a (range 20)
b (range 20)
c (range 20)
d (range 20)]
(conj #{[a b]} [c d])))
(def testvec (for [a (range 20)
b (range 20)
c (range 20)
d (range 20)]
[[a b] [c d]]))
@emlyn
emlyn / wikioccs2stanford.py
Last active December 19, 2015 02:38
Convert Wikipedia occurrences file generated by DBpedia Spotlight ExtractOccsFromWikipedia (& ExtractCandidateMap) to a format suitable for training Stanford NER (note: occurrences file must not be uri-sorted).
#!/usr/bin/env python
import argparse
from nltk.tokenize import wordpunct_tokenize
def split_parts(fulltext, entities):
text = fulltext
shift = 0
done = 0
parts = []
@emlyn
emlyn / extract-text.clj
Last active December 18, 2015 12:28 — forked from jashmenn/extract-text.clj
Extract the text from a webpage using jericho html parser in clojure. Run with 'lein one-off extract-text.clj filename.html'
#_(defdeps [[net.htmlparser.jericho/jericho-html "3.1"]])
(ns foo.preprocess
(:import [java.io File BufferedInputStream FileInputStream]
[net.htmlparser.jericho Source TextExtractor HTMLElementName]))
(defn my-text-extractor [source]
(proxy [TextExtractor] [source]
(excludeElement [tag]
(= (.getName tag) HTMLElementName/PRE))))
@emlyn
emlyn / gbaltgr
Last active December 16, 2015 08:28
Custom keyboard layout: UK international with AltGr dead keys
partial alphanumeric_keys
xkb_symbols "altgr-gb" {
name[Group1]= "English (UK, international AltGr dead keys)";
include "latin"
key <TLDE> { [ grave, notsign, dead_grave, bar ] };
key <AE01> { [ 1, exclam, exclamdown, onesuperior ] };
key <AE02> { [ 2, quotedbl, dead_diaeresis, twosuperior ] };