Benjamin Schmidt bmschmidt

An idea that I proved unable to express in the number of characters on Twitter:

Train two word2vec models on the same corpus with 100 dimensions apiece; one with window size 5, and one with window size 15 (say).

Now you have 2 100-dimensional vector spaces with the same words in each.

That's the same as 1 200-dimensional vector space: you just append each of the vectors to each other.

That vector space has all the information from each of the original models in it: you can just use linear algebra to flatten it out along either of the original 100 degree vectors.

This is just a dummy file to get started

American Empire (1942)	Western	American __
American Teen (2008)	Documentary	American __
American Nymphette 2 (2001) (V)	Adult	American __
An American Journalist (2009)	Drama	An American __
An American Journalist (2009)	Short	An American __
American Arab (2013)	Biography	American __
American Arab (2013)	Documentary	American __
American Arab (2013)	Family	American __
American Arab (2013)	News	American __
An American Dream (1966)	Drama	An American __

	library(ggplot2)
	library(wordVectors)

	slopegraph = function(
	set1 = "RMP"
	,
	set2 = "genderless_RMP"
	,
	word = "she"
	,

	import re
	import gzip
	import sys

	def stripBadText(string):
	if string==None:
	return ""
	# No html tags
	string = re.sub("<[^>]+>","",string)
	# People don't talk in [brackets] or (inside parentheses), so I strip them.

	#' Performs lazy load on a directory
	#'
	#' @param path a filepath containing the necessary files for lazy loading
	#' @return NULL
	#'
	#' @details This function will go into a directory, search for all the files
	#' that seem like they can be lazily loaded and attempt to load them.
	#'
	lazierLoad <- function(path){
	files <- dir(path)

	mkdir GUI
	make linechartGUI webDirectory=GUI
	git clone http://github.com/Bookworm-Project/BookwormAPI GUI/cgi-bin
	git clone http://github.com/bmschmidt/BookwormD3 GUI/BookwormD3
	cd GUI; python -m CGIHTTPServer 8000

	layers = $(subst .shp,.topojson,$(subst shapefiles/,topojson/,$(wildcard shapefiles/*.shp)))


	all: $(layers)

	topojson/%.json: shapefiles/%.shp
	ogr2ogr -f GeoJSON $@ $<

	topojson/%.topojson: topojson/%.json
	topojson -s .000001 -p -o $@ $<

	MariaDB [DEGREES]> SELECT cip,cipTitle,SUM(count) FROM degrees NATURAL JOIN cip WHERE year=2010 AND school=186131 and variable="TOTAL" AND level=5 GROUP BY cip;
	+---------+----------------------------------------------------------------------+------------+
	\| cip \| cipTitle \| SUM(count) \|
	+---------+----------------------------------------------------------------------+------------+
	\| 4.0201 \| Architecture. \| 15 \|
	\| 5.0104 \| East Asian Studies. \| 10 \|
	\| 5.0108 \| Near and Middle Eastern Studies. \| 13 \|
	\| 14.0701 \| Chemical Engineering. \| 36 \|
	\| 14.0801 \| Civil Engineering, General. \| 18 \|
	\| 14.0901 \| Computer Engineering, General. \| 44 \|



	all:
	find ~ \| parallel -n 1 -P 4 python dummy.py