Linked to from http://lukecod.es/2012/11/18/random-problem-of-the-night/
This is a node.js crawler that will crawl an entire site (using crawl) to find all internal links in the entire site. It will then test each unique internal link for the presence of an optional string and then the query string into an object. All values with the same key from the query string will be pushed to an array for that key.
npm install
node app.js http://site-to-crawl.com /only/return/links/containing/this/path
{
'a': [
'x',
'y',
'z'
],
'b': [
'c',
'd'
],
'z': [
1
]
}
node app.js http://www.biooncology.com /clinical-trials
{
"tumor": [
"breast cancer",
"cll",
"dlbcl",
"fnhl",
"colorectal cancer",
"gastric cancer",
"glioblastoma",
"lung cancer",
"melanoma",
"ovarian cancer",
"multiple myeloma",
"pancreatic cancer",
"other tumor types",
"renal cell carcinoma",
"colon cancer",
"liver cancer"
],
"drug": [
"pi3k inhibitor (gdc-0941)",
"pi3k/mtor inhibitor (gdc-0980)",
"obinutuzumab (ga101)",
"onartuzumab (metmab)",
"mek inhibitor (gdc-0973)",
"akt inhibitor (gdc-0068)",
"anti-egfl7",
"dulanermin"
]
}
This is now a repo: https://github.com/lukekarrys/node-qs-crawler