Red Hook Mesh Network

My plan was to collect data about the usage of the Red Hook mesh network, but I failed to collect the proper data!

There is a unix utility called collectd which looks to be a great system operator tool for collecting and storing data over time. collectd is installed on all the routers in the mesh network, however I was never able to get the iptables plugin to report bandwidth used for any collectd instance. Sigh. I even asked people to help and tried a hack to get something, but nothing worked.

When the time came to perform some data analysis, I knew that although I didn't have quite the right data, I was still collecting data, so I started looking into that. A plugin that reported data on the mesh routing algorithm was installed, but when I investigated the data it turned out I had only two whole hours of said data. It looked like some of the collectd settings on the remote mesh device were reset at some point and stopped forwarding the data to a central repository. Thus I basically had no data. :(

So apparently collecting this data wasn't in the cards as a final project for this class. Collecting data properly is huge! Lessons learned.

Moving On: Guidestar

I decided to investigate the Guidestar data and questions.

The questions:

Can we describe what makes an organization "defunct"? Can you build a classifier to identify defunct non-profits?
Can we predict which non-profits are going to be defunct? It's one thing to know that a group is going defunct this year (or has already gone defunct) but that's often too late. Can we tell which organizations are going to be defunct in, say, 3 years?
Are there hidden patterns in the data? Are there clusters of organizations that are more or less alike? Does the way they cluster tend to tell us anything about whether they're going to be defunct?

Starting out, I didn't narrow in on a particular question (they all seem related, any investigation will help answer any of the questions). I began by investigating a random sample of the data.

1D analysis

Histogram of organization IDs

A majority of the organizations appear once in the dataset, 90% appear twice or less

> table(table(sample$Org_ID))
   1    2    3    4    5 
1558  662  177   40    3

About two-thirds of the entries are not marked as defunct.

> table(sample$Defunct)
FALSE  TRUE 
 2377  1211

How many years of reporting a nonprofit took to be defunct

> org_id_freq <-  as.data.frame(table(sample$Org_ID))
> names(org_id_freq) <- c("Org_ID", "freq")
> sample.freq <- merge(sample, org_id_freq)
> table(paste(sample.freq$Defunct, sample.freq$freq))
FALSE 1 FALSE 2 FALSE 3 FALSE 4 FALSE 5  TRUE 1  TRUE 2  TRUE 3  TRUE 4 
   1020     908     330     104      15     538     416     201      56 
> plot(table(sample.freq[sample.freq$Defunct == FALSE,]$freq), col="black", type="l", ylab="number of nonprofits", xlab="years reported")
> lines(table(sample.freq[sample.freq$Defunct == TRUE,]$freq), type="l", col="red")
> legend("topright", col = c("black", "red"), legend = c("nonprofits", "defunct nonprofits"), lty=1, lwd=1)
> title("Reported Nonprofits That Went Defunct")

2D Relationships

Are there any relationships between total revenue, expenses, assets, and liabilities?

Brushing with ggobi

Using ggobi, I brushed through the data and found that there appeared to be a correlation between total revenue and expenses. Indeed, this was confirmed by using the cor function to find the correlation between the columns. This makes sense, because the more money a nonprofit makes, the more it will likely spend.

> revenue.total.cor = data.frame(category=c(), correlate=c());
> for (i in 1:33) {
>   col <- names(sample)[[i]]
>   revenue.total.cor <- rbind(revenue.total.cor, data.frame(category=col, correlate=cor(sample[col], sample$revenue_total)))
> }
...
expenses_total                                     expenses_total  0.855104943
...

There wasn't really any other correlations of note between these categories. Government grants and total revenue seemed to be correlated... but probably all types of revenue and total revenue are positively correlated.

2D Correlation to Defunct

A hunch I wanted to follow was if a human element (more money spent on people / space) would help keep the nonprofit alive. Is there a relationship between mortgage / expenses in personnel and whether or not a nonprofit is defunct? It turns out there wasn't much of a 2D correlation to defunct in general (up to 0.2 correlation):

> defunct.cor = data.frame(category=c(), correlate=c())
> for (i in 1:33) {
>   col <- names(sample)[[i]]
>   defunct.cor <- rbind(defunct.cor, data.frame(category=col, correlate=cor(sample[col], sample$Defunct)));
> } 
> plot(defunct.cor)

Classification

Okay, so by this point I've decided that I want to tackle the following question:

Can we describe what makes an organization "defunct"? Can you build a classifier to identify defunct non-profits?

With my sampled dataset (10% of the original size), I will use the random forest technique with all the components of the data to build a classifier, then test it out on the original dataset.

The first step I needed to do was remove the organization ID from the data frame (because that was the singular best correlated column!) and turn the Defunct column into a factor.

> fit <- rpart(Defunct ~ ., data=sample.no.id)
> plot(fit, uniform=TRUE, main="Classification Tree for Defunct Non-profits")
> text(fit, use.n=TRUE, all=TRUE, cex=.8)

Turns out this did work and it also gave us a clue as to what columns the algorithm thinks contributes to predicting defunct or not.

Variables actually used in tree construction:
[1] assets_cash_eoy           assets_savings_eoy       
[3] expenses_program_services net_assets_eoy           
[5] revenue_sales_inventory

Indeed, by looking at a random forest from the same data, these are a combination of the most important variables.

> sample.rf <- randomForest(Defunct ~ ., data=sample.no.id, ntree=1000, importance=TRUE, proximity=TRUE)
> imp <- as.data.frame(importance(sample.rf))
> tail(imp[order(imp$MeanDecreaseAccuracy),c(3,4)])
                          MeanDecreaseAccuracy MeanDecreaseGini
assets_savings_eoy                    32.25198         66.49494
expenses_total                        33.10249         76.00178
expenses_program_services             33.72821         77.65360
assets_total_eoy                      37.40280         83.23952
net_assets_eoy                        38.06725         91.61521
revenue_sales_inventory               66.17169         65.37851

Multidimensional Scaling

Let's use multidimensional scaling to see the random forest results a little. This gives us a visual idea of how accurate the random forest classifier is.

MDSplot(sample.rf, sample.no.id$Defunct)

As we can see, there do seem to be two distinct groups, but they also overlap. According to the random forest function we ran, the accuracy of the random forest classifier isn't the greatest at 27.15%.

        OOB estimate of  error rate: 27.15%
Confusion matrix:
      FALSE TRUE class.error
FALSE  2196  181   0.0761464
TRUE    793  418   0.6548307

Applying the Classifier

Now let's go back and test our classifier.

> data <- read.csv("defunct.csv")
> data$predict <- predict(sample.rf, data)
> table(data[,c('predict', 'Defunct')])
       Defunct
predict FALSE  TRUE class.error
  FALSE 22542  7085   0.2391400
  TRUE   1373  5395   0.2028664

As we know from before, the OOB error was 27.15%. How does this compare to what we just did?

> dim(data)
[1] 36395    34
> (22542+5395)/36395
[1] 0.7676054
> 1-((22542+5395)/36395)
[1] 0.2323946

It's a few points less at 23.24%. Most of those were false negatives (19.47% total) and a few were false positives (3.77% total).

Conclusion

We have a classifier that is 76.76% accurate. This is... what it is. Not the best, not the worst. It could be useful to give back to Guidestar, especially to tell them what appear to be the most important columns in their data that correlate to whether or not a non-profit goes defunct. (download the classifier here)

Interestingly, non-profits that have sales revenue from inventory appear to pretty much be okay. Assets appear to be the second most important indicator, and having higher expenses appears to be a good sign.

Lots more work could be done in improving the classifier. Perhaps building a classifier for each year would be a good idea to see if annual models outperform overall models.

boxysean/data-without-borders-final.md