Skip to content

Instantly share code, notes, and snippets.

@dotike
Created January 4, 2014 16:47
Show Gist options
  • Save dotike/8257249 to your computer and use it in GitHub Desktop.
Save dotike/8257249 to your computer and use it in GitHub Desktop.
wikipedia dumps, fetch last-1-good dump
#!/bin/sh
##############################################################################
# This code known is distributed under the following terms:
#
# Copyright (c) 2012-2013 Isaac (.ike) Levy <[email protected]>.
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
# OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
##############################################################################
# simple suck-down of last good wikipedia dump
# http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
#
# this program can be called from cron:
# REMOTESRC and LOCALMIRROR variables can be set from crontab (cron makes them ENV)
#
## REMOTESRC="ftpmirror.your.org/wikimedia-dumps"
## LOCALMIRROR="/path/to/somewhere"
## 15 4 * * 6 user lockf -t 300 /tmp/wikipediasync.lock /path/to/this/program
#
shout() { echo "$0: $*" >&2; }
die() { shout "$*"; exit 111; }
try() { "$@" || die "cannot $*"; }
REMOTESRC="${REMOTESRC:-ftpmirror.your.org/wikimedia-dumps}"
LOCALMIRROR="${LOCALMIRROR:-/tmp/wikipedia}"
_lastgood="${LOCALMIRROR}/rsync-filelist-last-1-good.txt"
logger "START wikipedia last dump sync"
# fetch the last-good files list
try rsync -avz --quiet --delete "rsync://${REMOTESRC}/rsync-filelist-last-1-good.txt" "${_lastgood}"
# fetch the last-good files
try rsync -avz --quiet --files-from="${_lastgood}" --delete "rsync://${REMOTESRC}/" "${LOCALMIRROR}/wikipedia-1-good/"
logger "FINISH wikipedia last dump sync"
true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment