mutedial gregdl

There are packages for this now!

2017-08-03: Since I wrote this in 2014, the universe, specifically Kirill Müller (https://github.com/krlmlr), has provided better solutions to this problem. I now recommend that you use one of these two packages:

rprojroot: This is the main package with functions to help you express paths in a way that will "just work" when developing interactively in an RStudio Project and when you render your file.
here: A lightweight wrapper around rprojroot that anticipates the most likely scenario: you want to write paths relative to the top-level directory, defined as an RStudio project or Git repo. TRY THIS FIRST.

I love these packages so much I wrote an ode to here.

I use these packages now instead of what I describe below. I'll leave this gist up for historical interest. 😆

	import nltk

	with open('sample.txt', 'r') as f:
	sample = f.read()

	sentences = nltk.sent_tokenize(sample)
	tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
	tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
	chunked_sentences = nltk.batch_ne_chunk(tagged_sentences, binary=True)

	# Requirements
	#sudo apt-get install libcurl4-gnutls-dev # for RCurl on linux
	#install.packages('RCurl')
	#install.packages('RJSONIO')

	library('RCurl')
	library('RJSONIO')

	query <- function(querystring) {
	h = basicTextGatherer()

	## An implementation of Gibbs sampling for topic models for the
	## example in section 4 of Steyvers and Griffiths (2007):
	## http://cocosci.berkeley.edu/tom/papers/SteyversGriffiths.pdf
	##
	## Author: Jason Baldridge ([email protected])

	# Functions to parse the input data
	words.to.indices = data.frame(row.names=c("r","s","b","m","l"),1:5)
	mysplit = function(x) { strsplit(x,"")[[1]] }
	word.vector = function(x) { words.to.indices[mysplit(x),] }

	from sussex_nltk import untag_sequence, extract_by_pos

	all_tags = r".+"
	all_nouns = r"N+"
	all_verbs = r"V+"
	all_adjectives = r"J+"

	example_tagged_words = [('The', 'DT'), ('little', 'JJ'), ('badgers', 'NNS'), ('ate', 'VBP'), ('some', 'DT'), ('jam', 'NN')]

	#Decide on some patterns to match

	#Script tags POS and NER[Named Entity Recognition] for a supplied text file.
	#Date: Nov 2 2012
	#Author: Hota Sobhan

	import nltk

	f = open('C:\Python27\Test_File.txt')
	data = f.readlines()

	#Parse the text file for NER with POS Tagging

	# Set working directory
	dir <- "C:\\" # adjust to suit
	setwd(dir)

	# configure variables and filenames for MALLET
	## here using MALLET's built-in example data and
	## variables from http://programminghistorian.org/lessons/topic-modeling-and-mallet

	# folder containing txt files for MALLET to work on
	importdir <- "C:\\mallet-2.0.7\\sample-data\\web\\en"

	# coding=UTF-8
	import nltk
	from nltk.corpus import brown

	# This is a fast and simple noun phrase extractor (based on NLTK)
	# Feel free to use it, just keep a link back to this post
	# http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/
	# Create by Shlomi Babluki
	# May, 2013

	# sources:
	# http://www.jgoodwin.net/?p=1223
	# http://orgtheory.wordpress.com/2012/05/16/the-fragile-network-of-econ-soc-readings/
	# http://nealcaren.web.unc.edu/a-sociology-citation-network/
	# http://kieranhealy.org/blog/archives/2014/11/15/top-ten-by-decade/
	# http://www.jgoodwin.net/lit-cites.png


	###########################################################################
	# This first section scrapes content from the Web of Science webpage. It takes

	# Serrano, Boguna, Vespigani backbone extractor
	# from http://www.pnas.org/content/106/16/6483.abstract
	# Thanks to Michael Conover and Qian Zhang at Indiana with help on earlier versions
	# Thanks to Clay Davis for pointing out an error

	import networkx as nx
	import numpy as np

	def extract_backbone(g, weight='weight', alpha=.05):
	backbone_graph = nx.Graph()