François Briatte briatte

Collapsing a 4-column data frame of real data from 500,000 rows to 91,000 by pasting and counting row values. Execution on a 1.8GHz Intel Core i5 shows that dplyr is 1.5 times quicker than base R.

See this Gist for a simpler test over twice more rows and roughly as many groups. In both tests, dplyr is as concise as plyr, as fast as data.table, and clearly more readable than base R.

Here's a simple timing test of aggregation functions in R, using 1.3 million rows and 80,000 groups of real data on a 1.8GHz Intel Core i5. Thanks to Arun Srinivasan for helpful comments.

The fastest function to run through the data.frame benchmark is data.table, which runs twice faster than dplyr, which runs ten times faster than base R.

For a benchmark that includes plyr, see this earlier Gist for a computationally more intensive test on half a million rows, where dplyr still runs 1.5 times faster than aggregate in base R.

Both tests confirm what W. Andrew Barr blogged on dplyr:

the 2 most important improvements in dplyr are >

Improving access to panel series data for social scientists: the `psData` package

GitHub repository: https://github.com/rOpenGov/psData

Social scientists have access to many electronically available panel series datasets. However, downloading, cleaning, and merging them together is time-consuming and error-prone: for example, using Reinhart and Rogoff's data on the fiscal costs of the financial crisis involves downloading, cleaning, and merging 4 Excel files with over 70 individual sheets, one for each country’s data. Furthermore, because such datasets are not bundled in a format that is easy to manipulate, many of them are not updated on a regular basis.

In this talk, we introduce the psData package for the R statistical software. This package is being developed under the rOpenGov framework to solve two problems:

Time wasted by social scientists downloading, cleaning, and transforming commo

R code to build the Icelandic legal code by connecting law articles through their cross-references.

The raw data can be loaded in R as an object called net of class network by loading this shortlink:

source("https://goo.gl/q1JFih")

Edges are weighted by the number of (directed) mentions. Nodes are sized by their unweighted degree centrality and colored by their period of introduction into the code.

	porngram <- function(x = c("hardcore", "softcore"), ..., adjust = "xxx") {
	library(ggplot2)
	library(XML)
	library(reshape)
	library(rPython)

	x = c(x, ...)
	if (length(x) > 10) {
	x <- x[1:10]
	warning("Porngram API limit: only using first 10 phrases.")

	# ==============================================================================
	# TWEAKS
	# ==============================================================================

	# use nano instead of vi as default editor
	# (e.g. for crontab -e)
	#
	export EDITOR="nano"

	# tell ls to be colourful

	library(XML)
	library(jsonlite)
	library(plyr)

	dir.create("records")

	data = "meps.csv"
	if(!file.exists(data)) {

	html = "http://www.europarl.europa.eu/meps/en/directory.html?filter=all&leg="

	#' Get a PubMed search index
	#' @param query a PubMed search string
	#' @return the XML declaration of the search
	#' @example
	#' # Which articles discuss the WHO FCTC?
	#' pubmed_ask("FCTC OR 'Framework Convention on Tobacco Control'")
	pubmed_ask <- function(query) {

	# change spaces to + and single-quotes to URL-friendly %22 in query
	query = gsub("'", "%22", gsub(" ", "+", query))

	# parse XPath syntax from well-formed HTML
	library(XML)

	# complete archive will take ~ 1.4 GB on disk
	dir.create("declarations", showWarnings = FALSE)

	# finds 941 MPs on 2014-07-24 at website launch
	h = htmlParse("http://www.hatvp.fr/consulter-les-declarations-rechercher.html")
	h = paste0("http://www.hatvp.fr/", xpathSApply(h, "//div[@id='annuaire']///*/a/@href"))

François Briatte briatte

Improving access to panel series data for social scientists: the psData package

Improving access to panel series data for social scientists: the `psData` package