My plan was to collect data about the usage of the Red Hook mesh network, but I failed to collect the proper data!
There is a unix utility called collectd
which looks to be a great system operator tool for collecting and storing data over time. collectd
is installed on all the routers in the mesh network, however I was never able to get the iptables plugin to report bandwidth used for any collectd
instance. Sigh. I even asked people to help and tried a hack to get something, but nothing worked.
When the time came to perform some data analysis, I knew that although I didn't have quite the right data, I was still collecting data, so I started looking into that. A plugin that reported data on the mesh routing algorithm was installed, but when I investigated the data it turned out I had only two whole hours of said data. It looked like some of the collectd
settings on the remote mesh device were reset at some point and stopped forwarding the data to a central repository. Thus I basically had no data. :(
So apparently collecting this data wasn't in the cards as a final project for this class. Collecting data properly is huge! Lessons learned.
I decided to investigate the Guidestar data and questions.
The questions:
- Can we describe what makes an organization "defunct"? Can you build a classifier to identify defunct non-profits?
- Can we predict which non-profits are going to be defunct? It's one thing to know that a group is going defunct this year (or has already gone defunct) but that's often too late. Can we tell which organizations are going to be defunct in, say, 3 years?
- Are there hidden patterns in the data? Are there clusters of organizations that are more or less alike? Does the way they cluster tend to tell us anything about whether they're going to be defunct?
Starting out, I didn't narrow in on a particular question (they all seem related, any investigation will help answer any of the questions). I began by investigating a random sample of the data.
A majority of the organizations appear once in the dataset, 90% appear twice or less
> table(table(sample$Org_ID))
1 2 3 4 5
1558 662 177 40 3
About two-thirds of the entries are not marked as defunct.
> table(sample$Defunct)
FALSE TRUE
2377 1211
> org_id_freq <- as.data.frame(table(sample$Org_ID))
> names(org_id_freq) <- c("Org_ID", "freq")
> sample.freq <- merge(sample, org_id_freq)
> table(paste(sample.freq$Defunct, sample.freq$freq))
FALSE 1 FALSE 2 FALSE 3 FALSE 4 FALSE 5 TRUE 1 TRUE 2 TRUE 3 TRUE 4
1020 908 330 104 15 538 416 201 56
> plot(table(sample.freq[sample.freq$Defunct == FALSE,]$freq), col="black", type="l", ylab="number of nonprofits", xlab="years reported")
> lines(table(sample.freq[sample.freq$Defunct == TRUE,]$freq), type="l", col="red")
> legend("topright", col = c("black", "red"), legend = c("nonprofits", "defunct nonprofits"), lty=1, lwd=1)
> title("Reported Nonprofits That Went Defunct")
Are there any relationships between total revenue, expenses, assets, and liabilities?
Using ggobi, I brushed through the data and found that there appeared to be a correlation between total revenue and expenses. Indeed, this was confirmed by using the cor
function to find the correlation between the columns. This makes sense, because the more money a nonprofit makes, the more it will likely spend.
> revenue.total.cor = data.frame(category=c(), correlate=c());
> for (i in 1:33) {
> col <- names(sample)[[i]]
> revenue.total.cor <- rbind(revenue.total.cor, data.frame(category=col, correlate=cor(sample[col], sample$revenue_total)))
> }
...
expenses_total expenses_total 0.855104943
...
There wasn't really any other correlations of note between these categories. Government grants and total revenue seemed to be correlated... but probably all types of revenue and total revenue are positively correlated.
A hunch I wanted to follow was if a human element (more money spent on people / space) would help keep the nonprofit alive. Is there a relationship between mortgage / expenses in personnel and whether or not a nonprofit is defunct? It turns out there wasn't much of a 2D correlation to defunct in general (up to 0.2 correlation):
> defunct.cor = data.frame(category=c(), correlate=c())
> for (i in 1:33) {
> col <- names(sample)[[i]]
> defunct.cor <- rbind(defunct.cor, data.frame(category=col, correlate=cor(sample[col], sample$Defunct)));
> }
> plot(defunct.cor)
Okay, so by this point I've decided that I want to tackle the following question:
- Can we describe what makes an organization "defunct"? Can you build a classifier to identify defunct non-profits?
With my sampled dataset (10% of the original size), I will use the random forest technique with all the components of the data to build a classifier, then test it out on the original dataset.
The first step I needed to do was remove the organization ID from the data frame (because that was the singular best correlated column!) and turn the Defunct
column into a factor.
> fit <- rpart(Defunct ~ ., data=sample.no.id)
> plot(fit, uniform=TRUE, main="Classification Tree for Defunct Non-profits")
> text(fit, use.n=TRUE, all=TRUE, cex=.8)
Turns out this did work and it also gave us a clue as to what columns the algorithm thinks contributes to predicting defunct or not.
Variables actually used in tree construction:
[1] assets_cash_eoy assets_savings_eoy
[3] expenses_program_services net_assets_eoy
[5] revenue_sales_inventory
Indeed, by looking at a random forest from the same data, these are a combination of the most important variables.
> sample.rf <- randomForest(Defunct ~ ., data=sample.no.id, ntree=1000, importance=TRUE, proximity=TRUE)
> imp <- as.data.frame(importance(sample.rf))
> tail(imp[order(imp$MeanDecreaseAccuracy),c(3,4)])
MeanDecreaseAccuracy MeanDecreaseGini
assets_savings_eoy 32.25198 66.49494
expenses_total 33.10249 76.00178
expenses_program_services 33.72821 77.65360
assets_total_eoy 37.40280 83.23952
net_assets_eoy 38.06725 91.61521
revenue_sales_inventory 66.17169 65.37851
Let's use multidimensional scaling to see the random forest results a little. This gives us a visual idea of how accurate the random forest classifier is.
MDSplot(sample.rf, sample.no.id$Defunct)
As we can see, there do seem to be two distinct groups, but they also overlap. According to the random forest function we ran, the accuracy of the random forest classifier isn't the greatest at 27.15%.
OOB estimate of error rate: 27.15%
Confusion matrix:
FALSE TRUE class.error
FALSE 2196 181 0.0761464
TRUE 793 418 0.6548307
Now let's go back and test our classifier.
> data <- read.csv("defunct.csv")
> data$predict <- predict(sample.rf, data)
> table(data[,c('predict', 'Defunct')])
Defunct
predict FALSE TRUE class.error
FALSE 22542 7085 0.2391400
TRUE 1373 5395 0.2028664
As we know from before, the OOB error was 27.15%. How does this compare to what we just did?
> dim(data)
[1] 36395 34
> (22542+5395)/36395
[1] 0.7676054
> 1-((22542+5395)/36395)
[1] 0.2323946
It's a few points less at 23.24%. Most of those were false negatives (19.47% total) and a few were false positives (3.77% total).
We have a classifier that is 76.76% accurate. This is... what it is. Not the best, not the worst. It could be useful to give back to Guidestar, especially to tell them what appear to be the most important columns in their data that correlate to whether or not a non-profit goes defunct. (download the classifier here)
Interestingly, non-profits that have sales revenue from inventory appear to pretty much be okay. Assets appear to be the second most important indicator, and having higher expenses appears to be a good sign.
Lots more work could be done in improving the classifier. Perhaps building a classifier for each year would be a good idea to see if annual models outperform overall models.