Skip to content

Instantly share code, notes, and snippets.

View ottomata's full-sized avatar

Andrew Otto ottomata

View GitHub Profile
// spark2-shell --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar
/**
* Use RefineTarget.find to find all Refine targets for an input (camus job) in the last N hours.
* Then filter for any for which the _REFINED_FAILED flag exists.
*/
import import org.apache.hadoop.fs.Path
import org.joda.time.format.DateTimeFormatter
import com.github.nscala_time.time.Imports._
@ottomata
ottomata / spark-amm.sh
Created October 2, 2019 14:25
spark + ammonite
#!/usr/bin/env bash
export SPARK_HOME="${SPARK_HOME:-/usr/lib/spark2}"
export SPARK_CONF_DIR="${SPARK_CONF_DIR:-"${SPARK_HOME}"/conf}"
source ${SPARK_HOME}/bin/load-spark-env.sh
export HIVE_CONF_DIR=${SPARK_CONF_DIR}
export HADOOP_CONF_DIR=/etc/hadoop/conf
AMMONITE=~/bin/amm # This is amm binary release 2.11-1.6.7
@ottomata
ottomata / backfill_mediawiki_revision_score.scala
Last active October 2, 2019 14:17
backfill_mediawiki_revision_score.scala
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
// We need this to convert the out of order new schema to the new hive table schema.
// This also is used to drop columns that aren't in the new hive table schema.
import org.wikimedia.analytics.refinery.spark.sql.HiveExtensions._
// Get the new desired field schemas
val mediawiki_revision_score_2 = spark.table("event.mediawiki_revision_score")
== What onboarding projects are options? What about mixing some research questions with the task of processing data? (for example, find patterns of those who open an account on Wikipedia) @Joseph
* wikidump text analysis?
** category analysis?
Take Tiziano's code and use hadoop instead of wikidump text.
(1st, 2nd) ** historical redirect analysis, add to mediawiki_history (very useful for Analytics)
Please see: https://phabricator.wikimedia.org/T232123
@ottomata
ottomata / netflow_spark_structured_streaming.py
Created September 10, 2019 20:57
Spark Streaming SQL demo with netflow
# From stat1004:
# pyspark2 --jars ~otto/spark-sql-kafka-0-10_2.11-2.3.1.jar,~otto/kafka-clients-1.1.0.jar
# Need spark-sql-kafka for DataStream source and kafka-clients for Kafka serdes.
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Declare a Spark schema that matches the JSONData.
# In a future MEP world this would be automatically loaded
# from a JSONSchema.
function generateSchemaTests(title, majorVersion, schemaInfos) {
it(`All ${title} schemas should have title ${title} `, function() {
schemaInfos.forEach((info) => {
assert.equal(info.schema.title, title);
});
});
it(`All ${title} major version ${majorVersion} schemas should be ${majorVersion}.x.y`, function() {
schemaInfos.forEach((info) => {
assert.equal(semver.coerce(_.get(info.schema, '$id')).major, majorVersion);
});
title: mediawiki/revision/score
description: Represents a MW Revision Score event (from ORES).
$id: /mediawiki/revision/score/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
### revision-score does not include all revision/common fields, so we
### don't include revision/commmon schema, and instead specifically list
### the ones we need.
- $ref: /mediawiki/common/1.0.0
# Stop your Jupyter Notebook server from the JupyterHub UI.
# Move your old venv out of the way (or just delete it)
mv $HOME/venv $HOME/venv-old-$().$(date +%s)
# create a new empty venv
python3 -m venv --system-site-packages $HOME/venv
# Reinstall the jupyter venv
cd /srv/jupyterhub/deploy
$HOME/venv/bin/pip install --upgrade --no-index --force-reinstall --find-links=/srv/jupyterhub/deploy/artifacts/stretch/wheels --requirement=/srv/jupyterhub/deploy/frozen-requirements.txt
You are given a very very large list of unsorted integers. These
integers are supposed to be unique and, if sorted, contiguous. However, you
suspect that this is not the case, so you want to write code to check for
missing or duplicate integers. Write code to return these results:
- Are there any missing or duplicate integers?
- How many missing integers?
- How many duplicate integers?
- Which integers are missing?
- Which integers are duplicates, and how many duplicates of each
title: mediawiki/page/links-change
description: Represents a MW Page Links Change event.
$id: /mediawiki/page/links-change/1.1.0
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
required:
- $schema
- meta
- page_id
- page_is_redirect