In this lab, you'll make an inverted index from a subset of the Wikipedia dataset. As a reminder, an inverted index maps each term in a document (e.g., "alligator", "Ryan") to a set of documents in which that term appears. An example inverted index might look like:
penguin -> 91, 12, 58, 10
rhino -> 184, 182, 87, 10
This inverted index indicates that the term "rhino" appears in documents 184, 182, 87, and 10. You'll be using MapReduce to make your inverted index. For this lab, you will only be running code on your local machine. Refer to the guide posted on LATTE or here for more information.
You can find the data for this lab, a small (300 MB) subset of the Wikipedia data, here: http://cs.brandeis.edu/~rcmarcus/lab2_data.tar.xz. To extract that data, Windows users can use their preferred tool (such as 7-Zip), and Mac/Linux users can execute the following command (many Linux distributions and some Mac systems may also support simply double-clicking on the archive):
tar xJvf lab2_data.tar.xz
The resulting folder will contain three CSV files, each organized into lines that look like this:
docID, docURL, docTitle, doc content...
docID, docURL, docTitle, doc content...
docID, docURL, docTitle, doc content...
Your task is to create an inverted index from these three data files (a single index should contain the data from all three files). Note that you can call TextInputFormat.addInputPath
multiple times to cause Hadoop MapReduce to process the data from more than one file.
As a reminder, the high-quality inverted index you will eventually need to build will have to:
- Remove "stop words", such as "the", "and", "or", etc.
- Remove junk tokens like "2" and "&"
- Remove noise and repetitive tokens, such as "Olga!" or "Olga?" (they should both be mapped to "olga")
- Stem words, such that "laughter", "laugh", and "laughing" all map to the same term (you may use an external library for this).
- Be ran on the full dataset (~67GB) on the cluster.
- Include the location of the term inside of the document
- ... possibly a few more things. :)
Once you have completed your code, submit it on LATTE by exporting it to a JAR. In the Eclipse export window, you must check the "Export Java source files and resources" box. Additionally, please make sure that your JAR does not contain any data files, or your resulting inverted index.