Skip to content

Instantly share code, notes, and snippets.

@pi0
Last active February 27, 2020 14:16
Show Gist options
  • Save pi0/5b5ca846f467b3c6f44701f401f734ef to your computer and use it in GitHub Desktop.
Save pi0/5b5ca846f467b3c6f44701f401f734ef to your computer and use it in GitHub Desktop.
A Dummy Way to Extract Clean Wikipedia Contents
const fs = require('fs')
const parse = require('wtf_wikipedia')
const readline = require('readline')
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
terminal: false
})
const fd = fs.openSync('out.txt', 'w+')
process.on('exit', () => { fs.closeSync(fd) })
const startrChars = '-*{}[]#!@%^&<>|'.split('')
rl.on('line', line => {
line = line.trim()
if (!line.length) { return }
if (startrChars.includes(line[0])) { return }
try {
const parsed = parse(line)
const text = parsed.text().trim()
if (!text.length) { return }
fs.writeSync(fd, text + '\n')
} catch (e) {
console.log(line)
console.log('Error while parsing: ' + e)
}
})

Dump Wikipedia

  • Go to https://dumps.wikimedia.org/backup-index.html
  • Go to language and download -pages-meta-current.xml.bz2 version
  • Extract
  • Flatten with pv file.xml | xml2 > file.flat
  • Exrtract text with pv file.flat| grep -oP "(?<=^/mediawiki/page/revision/text=).*" > file.txt
  • Clean with pv fawiki-20200220-pages-articles.txt | node clean.js
  • Enjoy using out.txt

Requirements

  • Node.js + [optional] yarn + wtf_wikipedia dependency
  • Unix tools: pv + xml2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment