- Go to https://dumps.wikimedia.org/backup-index.html
- Go to language and download
-pages-meta-current.xml.bz2version - Extract
- Flatten with
pv file.xml | xml2 > file.flat - Exrtract text with
pv file.flat| grep -oP "(?<=^/mediawiki/page/revision/text=).*" > file.txt - Clean with
pv fawiki-20200220-pages-articles.txt | node clean.js - Enjoy using
out.txt
- Node.js + [optional] yarn +
wtf_wikipediadependency - Unix tools:
pv+xml2