- Go to https://dumps.wikimedia.org/backup-index.html
- Go to language and download
-pages-meta-current.xml.bz2
version - Extract
- Flatten with
pv file.xml | xml2 > file.flat
- Exrtract text with
pv file.flat| grep -oP "(?<=^/mediawiki/page/revision/text=).*" > file.txt
- Clean with
pv fawiki-20200220-pages-articles.txt | node clean.js
- Enjoy using
out.txt
- Node.js + [optional] yarn +
wtf_wikipedia
dependency - Unix tools:
pv
+xml2