Skip to content

Instantly share code, notes, and snippets.

View jstray's full-sized avatar

Jonathan Stray jstray

View GitHub Profile
@jstray
jstray / gist:fe34b6b7079c6bf15dc1
Last active April 26, 2016 20:05
Threat Modeling: planning digital security for your story

Journalism can be a high-risk activity, and some stories are a lot riskier than others. In a part one we covered the digital security precautions that every journalist should take. If one of your colleagues uses weak passwords or clicks on a phishing link, more sophisticated efforts are wasted. But assuming that everyone you are working with is already up to speed on basic computer security practice, there's a lot more you can do to provide security for a specific, sensitive story.

This work begins with thinking through what it is you have to protect, and from whom. This is called threat modeling and is the first step in any security analysis. The goal is to construct a picture -- in some ways no more than an educated guess -- of what you're up against. There are many ways to do this, but this post is structured around four basic questions.

  • What do you want to keep private?
  • Who wants to know?
  • What can they do to fi

You got the documents. Now what?

[omg documents.png]

Congratulations! Your Freedom of Information request finally yielded a big brown envelope in the mail. You are the lucky recipient of a juicy leak. You've managed to scrape all the PDFs from that stone-age government portal. Now all you have to do is the reporting.

Would that it were so easy. Your next steps depend on what you've got and what you're trying to do. You might have one page or one million pages. You could be starting with a tall stack of paper or a CSV file or anything in between. Maybe you already know exactly what you're looking for, or maybe that anonymous tip was maddeningly non-specific. In the course of my work on the Overview document-mining software I've seen just about every problem that a journalist can have with a document-driven story. These are the tales of unreadable formats, heaps of paper, and late nights reading. This post is organized as a sort of flowchart, a series of questions you can ask

You got the documents. Now what?

[omg documents.png]

Congratulations! Your Freedom of Information request finally yielded a big brown envelope in the mail. You are the proud owner of a juicy leak. You've managed to scrape all the PDFs from that stone-age government portal. Now all you have to do is the reporting.

In the course of my work on the Overview document-mining software I've seen just about every problem that journalists can have with a document-driven story. These are the stories of unreadable formats, heaps of paper, and late nights reading.

When you're the proud owner of a brand new document dump, the next steps depend on what you've got and what you're trying to do. You might have one page or one million pages. You could be starting with a tall stack of paper or a CSV file or anything in between. Maybe you already know exactly what you're looking for, or maybe that anonymous tip was maddeningly non-specific. This post is organized as a sort of flowchart, a seri

@jstray
jstray / what-to-do-with-documents.md
Last active March 10, 2019 22:38
You've got the documents, now what?

You got the documents. Now what?

[omg documents.png]

Congratulations! Your Freedom of Information request finally yielded a big brown envelope in the mail. You are the proud owner of a juicy leak. You've managed to scrape all the PDFs from that stone-age open government portal. Now all you have to do is report.

In the course of my work on the Overview document mining software I've seen just about every problem that journalists can have with a document-driven story. These are the stories of unreadable formats, heaps of paper, and late nights reading.

When you're the proud owner of a brand new document dump, the next steps depend on what you've got and what you're trying to do. You might have one page or one million pages. You could be starting with a tall stack of paper or a CSV file or anything in between. Maybe you already know exactly what you're looking for, or maybe that anonymous tip was so non-specific you don't know where to start. This post is organized as a sort o

@jstray
jstray / gist:6003431
Last active December 19, 2015 19:08
Drawing conclusions from data

The job of a data journalist is to turn data into a story. If you start with a spreadsheet of cancer rates, the story might be "people living near oil refineries had three times the rate of lung cancer." Or it might not be, because you could be mis-interpreting the data in some way. This recorded talk is about how not to get fooled when you go looking for stories in your data.

<iframe width="420" height="315" src="//www.youtube.com/embed/3NuyRKNkBQg" frameborder="0" allowfullscreen></iframe>

This lecture was given as part of the 15th Annual Science Immersion Workshop for Journalists at the Metcalf Institute for Marine & Environmental Reporting, Rhode Island. The slides are here, and the Github repo with all the R code needed to reproduce the examples in the talk is here.

###Interpreting data

A data journalism story is usually ab

@jstray
jstray / gist:3741305
Created September 18, 2012 04:45
Apply multi-dimensional scaling to House of Lords voting history distance matrix, and plot
# -------------------------------- MDS plot ------------------------------
fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim
x <- fit$points[,1]
y <- fit$points[,2]
# ]plot with colors corresponding to party
parties = factor(row.names(recentvotes))
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="House of Lords voting", pch=19, col=parties)
legend('topright', legend = levels(parties), col=palette(), cex = 0.8, pch = 1)
@jstray
jstray / gist:3741280
Created September 18, 2012 04:36
Distance function computation for UK House of Lords voting analysis
# -------------------------------- Compute distances ------------------------------
# distance function = 1 - fraction of votes where both voted, and both voted the same
votedist <- function(v1, v2) {
overlap = v1!=0 & v2!=0
numoverlap = sum(overlap)
match = overlap & v1==v2
nummatch = sum(match)
if (!numoverlap)
dist = 1
@jstray
jstray / gist:3741258
Created September 18, 2012 04:26
Get recent votes in UK house of Lords data
# -------------------------------- Take recent votes ------------------------------
# take only N most recent votes
Nvotes=100
recentvotes = votes[,1:Nvotes]
# set MP row names to party name
row.names(recentvotes) = lords[,"party"]
# remove all MPs who didn't vote at all in these recent votes
@jstray
jstray / gist:3741250
Created September 18, 2012 04:21
Load in the UK house of Lords data
library(proxy) # need custom distance function capability
# -------------------------------- Load data ------------------------------
# Load in vote history
# strip out vote description, date, etc, and transpose so each row is an MP
votetable = read.csv("votematrix-lords.csv", header=T, sep=",")
votes = votetable[, 5:1047]
votes = t(votes)