Skip to content

Instantly share code, notes, and snippets.

@pauldx
Forked from acmiyaguchi/README.md
Created June 8, 2020 19:28
Show Gist options
  • Save pauldx/e8f45682b9838d027bf2b4b46823fe55 to your computer and use it in GitHub Desktop.
Save pauldx/e8f45682b9838d027bf2b4b46823fe55 to your computer and use it in GitHub Desktop.
Databricks Airflow Workflow

Databricks Airflow Workflow

Procedure

Modified repositories:

Deployment

  • Create a deployment bucket

    • For testing: s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/databricks-poc/
  • Deploy the telemetry-batch-view fat jar and mozetl_runner.py script

    • The bucket should look like the following:
    $ aws s3 ls s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/databricks-poc/
    2017-10-23 17:14:52         69 mozetl_runner.py
    2017-10-24 15:09:57  278415122 telemetry-batch-view-1.1.jar
    
  • Checkout, build, and run the example_databricks dag

Caveats

  • The DatabricksSubmitRunOperator reflects the RunSubmit api
    • The mozetl_task.json and tbv_task.json can be submitted to the /jobs/runs/submit api
    curl -n \
    -X POST -H 'Content-Type: application/json' \
    -d @mozetl_task.json \
    https://dbc-caf9527b-e073.cloud.databricks.com/api/2.0/jobs/runs/submit
  • Note that that this is configured with Databricks Runtime 3.3, with Spark 2.2 and Scala 2.1.1.
    • telemetry-batch-view requires scala 2.1.1
    • spark_python_task requires a runtime image >= 3.0
  • Python packages should be installed using the pypi
    • Distributing the raw egg via python setup.py bdist_egg does not package required dependencies.
  • The mozetl_task job will show up as failing in the Airflow UI despite the _SUCCESS indicator file being written out.
    • The error log is not very descriptive
    An exception has occurred, use %tb to see the full traceback.
    
    SystemExit: 0
    

References

from mozetl import cli
cli.entry_point(auto_envvar_prefix="MOZETL")
{
"run_name": "mozetl task",
"new_cluster": {
"spark_version": "3.3.x-scala2.11",
"node_type_id": "c3.2xlarge",
"aws_attributes": {
"availability": "ON_DEMAND",
"instance_profile_arn": "arn:aws:iam::144996185633:instance-profile/databricks-ec2"
},
"num_workers": 1
},
"libraries": [
{
"pypi": {
"package": "git+https://github.com/acmiyaguchi/python_mozetl.git@databricks-poc"
}
}
],
"timeout_seconds": 3600,
"spark_python_task": {
"python_file": "s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/databricks-poc/mozetl_runner.py",
"parameters": [
"example_python",
"--date", "20171003",
"--sample-id", "60",
"--bucket", "mozilla-databricks-telemetry-test"
]
}
}
{
"run_name": "tbv task",
"new_cluster": {
"spark_version": "3.3.x-scala2.11",
"node_type_id": "c3.2xlarge",
"aws_attributes": {
"availability": "ON_DEMAND",
"instance_profile_arn": "arn:aws:iam::144996185633:instance-profile/databricks-ec2"
},
"num_workers": 1
},
"libraries": [
{
"jar": "s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/databricks-poc/telemetry-batch-view-1.1.jar"
}
],
"timeout_seconds": 3600,
"spark_jar_task": {
"main_class_name": "com.mozilla.telemetry.views.ExampleView",
"parameters": [
"--date", "20171002",
"--sample_id", "59",
"--bucket", "mozilla-databricks-telemetry-test"
]
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment