parsing wikipedia dumps in clojure
(ns wikiparse.core
(:require [ :as io]
[ :as xml]
[ :refer [xml-zip]]
[ :refer [xml-> xml1-> text]])
(:import [ org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream])
(:gen-class :main true))
(defn bz2-reader
"Returns a streaming Reader for the given compressed BZip2
file. Use within (with-open)."
(-> filename io/file io/input-stream BZip2CompressorInputStream. io/reader))
(defn process-music-artist-page
"Process a wikipedia page, print the title if it's a musical artist"
(let [z (xml-zip page)
title (xml1-> z :title text)
page-text (xml1-> z :revision :text text)]
(if (#(re-find #"\{\{Infobox musical artist" page-text))
(println title))))
(defn wiki-music-artists
"parse up to [max] pages from a wikipedia dump, print out those that are musical artists"
[filename max]
(with-open [rdr (bz2-reader filename)]
(dorun (->> (xml/parse rdr)
(filter #(= :page (:tag %)))
(take max)
(map process-music-artist-page)))))
(def wikifile "enwiki-latest-pages-articles.xml.bz2")
(defn -main
[& args]
(wiki-music-artists wikifile 100000000))
Note you can get a torrent of the wikipedia dump at - it's 9G bzipped, or 42G if you unzip it (which is why the code above works on the bzipped version!)

The earlier version of this which just looked in page titles took 94 minutes to parse the 42G xml file, on my Macbook Pro.

This version takes 115 minutes, presumably due to the extra effort running regular expressions over the text of every page. Peak memory use is around 800M

