Modified repositories:
-
Create a deployment bucket
- For testing:
s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/databricks-poc/
- For testing:
-
Deploy the
telemetry-batch-view
fat jar andmozetl_runner.py
script- The bucket should look like the following:
$ aws s3 ls s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/databricks-poc/ 2017-10-23 17:14:52 69 mozetl_runner.py 2017-10-24 15:09:57 278415122 telemetry-batch-view-1.1.jar
-
Checkout, build, and run the
example_databricks
dag
- The
DatabricksSubmitRunOperator
reflects the RunSubmit api- The
mozetl_task.json
andtbv_task.json
can be submitted to the /jobs/runs/submit api
curl -n \ -X POST -H 'Content-Type: application/json' \ -d @mozetl_task.json \ https://dbc-caf9527b-e073.cloud.databricks.com/api/2.0/jobs/runs/submit
- The
- Note that that this is configured with Databricks Runtime 3.3, with Spark 2.2 and Scala 2.1.1.
telemetry-batch-view
requires scala 2.1.1spark_python_task
requires a runtime image >= 3.0
- Python packages should be installed using the pypi
- Distributing the raw egg via
python setup.py bdist_egg
does not package required dependencies.
- Distributing the raw egg via
- The
mozetl_task
job will show up as failing in the Airflow UI despite the_SUCCESS
indicator file being written out.- The error log is not very descriptive
An exception has occurred, use %tb to see the full traceback. SystemExit: 0