这个列表是一个速成指南,也可以作为具体专题深入研究的引用。本文其实就是你从来没去听过的或者已经健忘的计算机科学课程的总结,因此不可能写的很深入。原文已经在github gist 上公开,任何人都可以参与修改。
This guide sets up a non-clustered Nutch crawler, which stores its data via HBase. We will not learn how to setup Hadoop et al., but just the bare minimum to crawl and index websites on a single machine.
- Nutch - the crawler (fetches and parses websites)
- HBase - filesystem storage for Nutch (Hadoop component, basically)