What do we need?
- Scrape a page
- store the result in a relational DB
- also store the result in a full text index
- expose search API
- on search hit, return the full record & any score
How do we do this quickly / easily?
- Scraping:
ReaderMode
- This is in FF already, and it works decently enough
let result;
let url = 'http://time.com/3983182/inflatable-minion-dublin';
ReaderMode.downloadAndParseDocument(url)
.then(function(x) { result = x; }, console.error);
result.url; // "http://time.com/3983182/inflatable-minion-dublin"
result.title; // "Giant Inflatable Minion Causes Chaos on Dublin Road"
result.byline; // "Sarah Begley @SCBegley"
result.excerpt; // "The balloon got loose from a fairground"
result.length; // 906
result.dir; // undefined (??? todo)
result.content; // the content is a mess, lot of html still in it,
// see https://gist.github.com/6a68/d074fcff51cc39aa7e11
- Store the result in a relational DB
- we could create our own parallel places sqlite DB (visits table, urls table, etc)
- this is a lot of work but we can create indexes without hurting anything
- MDN docs: https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/Sqlite.jsm
- we could use IndexedDB (YDN has a full-text wrapper, too)
- Might run into space limitations: FF uses LRU to evict entire domains, we could lose all data without notice
- we could just overload the annotation field
- this is a lot less work, more of a prototypey hack, but could still actually work
- MDN docs: https://developer.mozilla.org/en-US/docs/Mozilla/Tech/Places/Using_the_Places_annotation_service
- postgres RDS???
- also store the result in a full text index
- not provided, sadly, by FF
- lunr.js is an option, no idea how it'll perform: http://lunrjs.com/
- interesting lunr + dexie + indexedDB demo: http://bl.ocks.org/nolanlawson/6f69f4a573c1da862e92
- maybe a bit more fully-featured: YDN
- base library wraps indexedDB: https://github.com/yathit/ydn-db
- same author supports full-text search on the same DB: https://github.com/yathit/ydn-db-fulltext
- this could give us both databases in the same spot, but then we're talking about replacing Places again
- AWS CloudSearch???
- expose search API
- on search hit, fetch the corresponding record & search score