This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Quick and hacky script that will use dyff to show the diff between | |
// Any modified materialized schema version and its previous version. | |
// | |
// Defaults to using https://github.com/homeport/dyff, so install that first. | |
// This could be cleaned up and incorporated into jsonschema-tools itself, and | |
// then shown in CI. | |
// | |
jsonschema_tools = require('@wikimedia/jsonschema-tools'); | |
const _ = require('lodash'); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Pyflink Streaming Table Env + Event Utilities Event Platform integration. | |
# Download wikimedia-event-utilities Flink: | |
# Download Wikimedia Event Utilities Flink jar, e.g. | |
# https://archiva.wikimedia.org/#artifact-details-download-content/org.wikimedia/eventutilities-flink/1.2.0 | |
# wget http://archiva.wikimedia.org/repository/releases/org/wikimedia/eventutilities-flink/1.2.0/eventutilities-flink-1.2.0-jar-with-dependencies.jar | |
# Also download other dependencies we need: | |
# wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-kafka/1.15.2/flink-connector-kafka-1.15.2.jar | |
# wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.2.3/kafka-clients-3.2.3.jar | |
# If you like, use ipython for your REPL, its nicer! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# How to run an airflow dag and/or task using analytics-privatedata on the CLI. | |
# | |
# - Make sure airflow-dags repo is on the PYTHONPATH, so wmf_airflow_common can load | |
# | |
# - Set AIRFLOW_HOME and HOME. This must be a directory that has airflow.cfg, etc., | |
# airflow.cfg and airflow sqlite dbs, and those must be readable and writable by analytics-privatedata user | |
# | |
# - Run bin/airflow out of an airflow conda env. /usr/lib/airflow is fine, but you can also use | |
# your development conda env (in ~/.conda/envs/airflow_development/bin/airflow ?) too. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
* Gets a {@link DataStreamSource} of {@link Row} for streamName that reads JSON events from Kafka. | |
* If you need more control of your KafkaSource, use | |
* {@link EventDataStreamFactory#kafkaSourceBuilder(String, String, String)} and then | |
* call {@link StreamExecutionEnvironment#fromSource(Source, WatermarkStrategy, String)} yourself. | |
* | |
* Example: | |
* <pre>{@code | |
* EventDataStreamFactory eventDataStreamFactory = EventDataStreamFactory.from(...) | |
* DataStreamSource<Row> eventStreamSource = eventDataStreamFactory.kafkaDataStream( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import org.apache.flink.streaming.api.scala._ | |
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment | |
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment | |
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env) | |
// Write case clases (or Java Pojos) for every event schema out there. | |
// There are at least 117 of them! | |
case class Meta( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Example pyspark code to integrate with Wikimedia Event Platform to automate | |
getting a Spark Structured Streaming DataFrame using event streams and event JSONSchemas. | |
See also: https://wikitech.wikimedia.org/wiki/Event_Platform | |
You'll need the following jar dependencies: | |
- Kafka Client: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
with DAG( | |
'run_hql_and_arhive', | |
default_args=default_args, | |
... | |
) as dag: | |
hql_runner_task_leaf = generate_hive_triggered_hql_runner_tasks( | |
dag=dag, | |
... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Test run the dag. This will wait for hdfs:///tmp/test_url_sensor.2022-01-01 to exist | |
~/.conda/envs/airflow1/bin/airflow dags test test_url_sensor 2022-01-01 | |
# In another terminal, create the file and watch the dag begin to run: | |
hdfs dfs -touchz /tmp/test_url_sensor.2022-01-01 to exist |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
""" | |
Imports and calls a python function. | |
call.py is a standalone python module and | |
CLI, and should not import any dependencies | |
unless they are available in any standard | |
python environment. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from wmf_airflow_common.operators import skein | |
op1 = skein.SkeinSparkSubmitOperator( | |
spark_submit_kwargs={ | |
'application': 'hdfs:///user/otto/spark-examples_2.11-2.4.4.jar', | |
'spark_submit': '/usr/bin/spark2-submit', | |
'master': 'yarn', | |
'deploy_mode': 'client', | |
'java_class': 'org.apache.spark.examples.SparkPi', | |
'application_args': ['10'], |