Skip to content

Instantly share code, notes, and snippets.

@kf0jvt
Last active January 2, 2016 01:39
Show Gist options
  • Save kf0jvt/8232184 to your computer and use it in GitHub Desktop.
Save kf0jvt/8232184 to your computer and use it in GitHub Desktop.
just fucking around
http://www.utphysicians.com/21756/uthealth-informs-patients-incident-related-patient-information/ (20130830),http://healthitsecurity.com/2013/08/29/ut-physicians-informs-patients-of-data-breach/ (20130830)
https://oag.ca.gov/system/files/Final%20version%20of%20breach%20notification%20in%20PDF%20format%20%2800751822%29_0.PDF http://www.phiprivacy.net/burglar-snatches-laptop-with-patient-medical-records-from-san-jose-internists-office/
http://doj.nh.gov/consumer/security-breaches/documents/waste-management-20070403.pdf
# Script to pull all the vcdb incidents where a laptop was stolen and then create a Document
# Term Matrix out of the articles used to code those incidents. Later we will use some packages like TM and stuff
# to do stuff to the text.
library(RMongo)
mongodb <- mongoDbConnect(dbName='kevin',host='localhost',port='27017')
collection <- 'vcdb'
stolen.querystring <- '{"action.physical.variety":"Theft","asset.assets":{"variety":"U - Laptop"}}'
# Get the stolen laptop incidents from the database
stolen.laptops <- dbGetQuery(mongodb,collection,stolen.querystring)
# head(stolen.laptops['reference'])
# nrow(stolen.laptops)
# ncol(stolen.laptops)
# stolen.laptops[['reference']][1] or stolen.laptops$reference[1]
get.urls <- function(inUrl){
local.urls <- sub(';',' ',inUrl)
local.urls <- sub(',','',local.urls)
local.urls <- strsplit(local.urls,'\ ')
return(grep('^http.*[^pPdDfF].*$',local.urls[[1]],value=TRUE))
}
stolen.laptop.urls <- unlist(lapply(stolen.laptops$reference,get.urls))
# https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R
source("htmlToText.R")
stolen.laptop.text <- lapply(stolen.laptop.urls,htmlToText)
@krmaxwell
Copy link

That regex on line #22 is wrong. I'd do something like:

^http.*(pdf|PDF)$

@krmaxwell
Copy link

Missed that you wanted to exclude those. Not sure if there's a parameter in R grep to return everything that doesn't match, but if you want everything that is a URL and does not end in 'PDF', use negative lookahead

^http.*(?!.*pdf)$

and also pass ignore.case=TRUE.

@kf0jvt
Copy link
Author

kf0jvt commented Jan 3, 2014

Turns out you have to tell R that it is using perl expressions too or it will bomb out with an invalid regex. Final finished line 22 is

return(grep('^http.*(?!.*pdf)$',local.urls[[1]],ignore.case=TRUE,perl=TRUE,value=TRUE))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment