Last active
January 2, 2021 18:41
-
-
Save kornysietsma/5939456 to your computer and use it in GitHub Desktop.
parsing wikipedia dumps in clojure
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(ns wikiparse.core | |
(:require [clojure.java.io :as io] | |
[clojure.data.xml :as xml] | |
[clojure.zip :refer [xml-zip]] | |
[clojure.data.zip.xml :refer [xml-> xml1-> text]]) | |
(:import [ org.apache.commons.compress.compressors.bzip2 BZip2CompressorInputStream]) | |
(:gen-class :main true)) | |
(defn bz2-reader | |
"Returns a streaming Reader for the given compressed BZip2 | |
file. Use within (with-open)." | |
[filename] | |
(-> filename io/file io/input-stream BZip2CompressorInputStream. io/reader)) | |
(defn process-music-artist-page | |
"Process a wikipedia page, print the title if it's a musical artist" | |
[page] | |
(let [z (xml-zip page) | |
title (xml1-> z :title text) | |
page-text (xml1-> z :revision :text text)] | |
(if (#(re-find #"\{\{Infobox musical artist" page-text)) | |
(println title)))) | |
(defn wiki-music-artists | |
"parse up to [max] pages from a wikipedia dump, print out those that are musical artists" | |
[filename max] | |
(with-open [rdr (bz2-reader filename)] | |
(dorun (->> (xml/parse rdr) | |
:content | |
(filter #(= :page (:tag %))) | |
(take max) | |
(map process-music-artist-page))))) | |
(def wikifile "enwiki-latest-pages-articles.xml.bz2") | |
(defn -main | |
[& args] | |
(wiki-music-artists wikifile 100000000)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Note you can get a torrent of the wikipedia dump at http://burnbit.com/torrent/246958/enwiki_latest_pages_articles_xml_bz2 - it's 9G bzipped, or 42G if you unzip it (which is why the code above works on the bzipped version!)
The earlier version of this which just looked in page titles took 94 minutes to parse the 42G xml file, on my Macbook Pro.
This version takes 115 minutes, presumably due to the extra effort running regular expressions over the text of every page. Peak memory use is around 800M