This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Written by Brendan O'Connor, [email protected], www.anyall.org | |
# * Originally written Aug. 2005 | |
# * Posted to gist.github.com/16173 on Oct. 2008 | |
# Copyright (c) 2003-2006 Open Source Applications Foundation | |
# | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# |
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 9 columns, instead of 5 in line 7.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
name,score_skewz,score_svd,url,v1,v2,v3,v4,v5 | |
The Politico,-0.133333333333333,-0.069840595513546,politico.com,-0.0579919888228,-0.0156533209161,-0.0118276408031,-0.000672353189093,0.00899951990495 | |
Right Wing Nut House,0.666666666666667,0.016997861495122,rightwingnuthouse.com,-0.0114438419789,0.00923210186058,-0.000332659887795,-0.00357075698976,0.0194133595538 | |
Chicago Tribune,0.0,0.011507686305562,chicagotribune.com,-0.00487815404818,0.0062502057793,0.00472616298604,-0.00370269426842,-0.00354255787188 | |
City Journal,0.566666666666667,0.002719928640919,city-journal.org,-0.000318806368726,0.00147728337907,0.000218460777,-0.000500262448403,-0.00112420748062 | |
Time,-0.1,-0.01921486123282,time.com,-0.0206799675285,-0.00430661260867,-0.00335205354211,-0.00167995286891,-0.0152016073966 | |
National Enquirer,0.533333333333333,-0.008120760725041,nationalenquirer.com,-0.00279469690892,-0.0018201000833,-0.00761346294708,0.00713945342214,-0.00165965873961 | |
AlterNet,-0.633333333333333,-0.029834727529704,alternet.org,-0.0066 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
""" | |
xlsx2tsv filename.xlsx [sheet number or name] | |
Parse a .xlsx (Excel OOXML, which is not OpenOffice) into tab-separated values. | |
If it has multiple sheets, need to give a sheet number or name. | |
Outputs honest-to-goodness tsv, no quoting or embedded \\n\\r\\t. | |
One reason I wrote this is because Mac Excel 2008 export to csv or tsv messes | |
up encodings, converting everything to something that's not utf8 (macroman |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
""" set operations on files as lists. symlink this as: | |
* setdiff [-c] <set1> <set2> - set difference | |
* setand [-c] <set1> <set2> - set intersection | |
* setor [-c] <set1> <set2> - set union | |
-c means: give count of the result | |
Output order is randomish | |
We don't newline chomp, so a bug if your file doesnt end with a newline | |
Dash - for stdin (e.g. cut/awk/sed/grep) | |
Though in zsh, =(bla bla) syntax is superior: can do 2 pipeline inputs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
""" sorts lines (or tab-sep records) by md5. (e.g. for train/test splits). | |
optionally prepends with the md5 id too. | |
brendan o'connor - anyall.org - gist.github.com/brendano """ | |
import hashlib,sys,optparse | |
p = optparse.OptionParser() | |
p.add_option('-k', type='int', default=False) | |
p.add_option('-p', action='store_true') | |
opts,args=p.parse_args() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""ajaxgoogle.py - Simple bindings to the AJAX Google Search API | |
(Just the JSON-over-HTTP bit of it, nothing to do with AJAX per se) | |
http://code.google.com/apis/ajaxsearch/documentation/reference.html#_intro_fonje | |
brendan o'connor - gist.github.com/28405 - anyall.org""" | |
try: | |
import json | |
except ImportError: | |
import simplejson as json | |
import urllib, urllib2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Pipe-oriented I/O in Python. This is harder than it should be. | |
# (1) Kill stdout buffering. makes redirects and tee easier to use. | |
if "<fdopen>" not in str(sys.stdout): sys.stdout = os.fdopen(1,'w',0) | |
# (2) Encoding madness. Note codecs.open() isn't available to us since we're using pipes. | |
import codecs | |
sys.stdout = codecs.EncodedFile(sys.stdout,'utf-8','utf-8','ignore') | |
# or this too .. sys.stdout = codecs.getwriter('utf-8')(sys.stdout) | |
# I'm interested in safely handling potentially garbled input data, so want to protect stdin. | |
# You'd think this would work: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Load the MNIST digit recognition dataset into R | |
# http://yann.lecun.com/exdb/mnist/ | |
# assume you have all 4 files and gunzip'd them | |
# creates train$n, train$x, train$y and test$n, test$x, test$y | |
# e.g. train$x is a 60000 x 784 matrix, each row is one digit (28x28) | |
# call: show_digit(train$x[5,]) to see a digit. | |
# brendan o'connor - gist.github.com/39760 - anyall.org | |
load_mnist <- function() { | |
load_image_file <- function(filename) { |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CSV from PostgreSQL, at least as far as I can tell. i'm sure messes up embedded quotes and maybe embedded commas. | |
psql.csv() { psql -qAF , "$@" | egrep -v '^\([0-9]+ rows\)$' } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
export TAB=$(echo -e "\t") | |
exec sort "-t$TAB" "$@" |