Skip to content

Instantly share code, notes, and snippets.

View milimetric's full-sized avatar

Dan Andreescu milimetric

  • Wikimedia Foundation
  • New York, NY
View GitHub Profile
@milimetric
milimetric / .gitignore
Last active December 15, 2015 12:49
Which skin do Wikipedia Editors use? Only look at editors with 5 or more edits over the past 30 days.
*swp
def get_udp2log_ports():
"""Returns the listen ports of running udp2log processes"""
pattern = "/usr/bin/udp2log"
return [get_p(cmd) for cmd in [get_cmd(pid) for pid in iter_pids()] if has_p(pattern, cmd)]
def has_p(pattern, cmd):
return pattern in cmd[0] and '-p' in cmd
def get_p(cmd):
return int(cmd[cmd.index('-p') + 1])
def deduplicate(list_of_objects, key_function):
uniques = dict()
for o in list_of_objects:
key = key_function(o)
if not key in uniques:
uniques[key] = o
return uniques.values()
@milimetric
milimetric / umapi.parallel.test
Last active December 17, 2015 10:09
quick and dirty script to test parallelism on umapi
# fill in u and p to the proper usernames and passwords
username=u
password=p
htuser=u
htpass=p
curl --data "username=$username&password=$password" https://$htuser:[email protected]/login -c ~/umapi.session
for cohort in test e2_aft5_cta4 e3_ob2b_gettingstarted_page-impression e3_ob4b_gettingstarted-addlinks_page-impression e3_ob4b_gettingstarted-clarify_page-impression e3_ob4b_gettingstarted-copyedit_page-impression
do
/srv/debugging.wmflabs.org/
/srv/dev-reportcard.wmflabs.org/
/srv/ee-dashboard.wmflabs.org/
/srv/gerrit-stats.wmflabs.org/
/srv/gp.wmflabs.org/
/srv/mobile-reportcard-dev.wmflabs.org/
/srv/mobile-reportcard.wmflabs.org/
/srv/test-reportcard.wmflabs.org/
REGISTER 'kraken-pig-0.0.2-SNAPSHOT.jar'
REGISTER 'kraken-generic-0.0.2-SNAPSHOT-jar-with-dependencies.jar'
REGISTER 'geoip-1.2.5.jar'
IMPORT 'include/load_webrequest.pig';
SET default_parallel 2;
DEFINE TO_HOUR org.wikimedia.analytics.kraken.pig.ConvertDateFormat('yyyy-MM-dd\'T\'HH:mm:ss', 'yyyy-MM-dd_HH');
DEFINE EXTRACT org.apache.pig.builtin.REGEX_EXTRACT_ALL();
DEFINE ZERO org.wikimedia.analytics.kraken.pig.Zero();
LOG_FIELDS = LOAD_WEBREQUEST('/wmf/raw/webrequest/webrequest-wikipedia-mobile/dt=2013-05-01*');
LOG_FIELDS = FILTER LOG_FIELDS BY (x_cs != '-');
self.create_test_cohort(
editor_count=4,
revisions_per_editor=3,
revision_timestamps=[
[
datetime(2012, 12, 31, 23, 0, 0),
datetime(2013, 1, 1, 0, 30, 0),
datetime(2013, 1, 1, 1, 0, 0),
],
[
@milimetric
milimetric / Missing Sequence Numbers in Hive
Last active January 11, 2023 16:43
Two ways to find missing sequence numbers in huge Hive tables. First way - gets the left and right boundaries of each run of missing sequences. Second way - gets the count of boundaries, which if greater than 2 signifies missing sequences. The second way doesn't tell you which sequences are missing or how many are missing, but runs faster.
/* Common setup, two variants follow
*/
use test;
set tablename=webrequest_esams0;
add jar /home/otto/hive-serdes-1.0-SNAPSHOT.jar;
add jar /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.3.1.jar;
create temporary function rowSequence AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
@milimetric
milimetric / import_one_hour.sh
Last active January 7, 2020 15:35
Imports one hour of Domasz's pageview data into a partitioned Hive table. It takes roughly 12 seconds to import one hour of data, btw. This means roughly 8 days to import all 7 years of data.
#!/bin/bash
#
# This script does the following:
# 0. reads four arguments from the CLI, in order, as YEAR, MONTH, DAY, HOUR
# 1. downloads the specified hour worth of data from http://dumps.wikimedia.org/other/pagecounts-raw/
# 2. extracts the data into hdfs
# 3. creates a partition on a hive table pointing to this data
#
print_help() {
@milimetric
milimetric / aggregate_daily.hql
Created October 4, 2013 19:57
Hive script to create an internal table and insert hourly data aggregated at the daily level.
DROP TABLE IF EXISTS milimetric_pagecounts_daily;
CREATE TABLE IF NOT EXISTS milimetric_pagecounts_daily(
project string,
page string,
views int,
bytes int,
year int,
month int,
day int
)