Skip to content

Instantly share code, notes, and snippets.

@inchoate
Last active February 2, 2024 11:40
Show Gist options
  • Save inchoate/bd0ff7f609f57c85d9de8ff9d5586e30 to your computer and use it in GitHub Desktop.
Save inchoate/bd0ff7f609f57c85d9de8ff9d5586e30 to your computer and use it in GitHub Desktop.
Adding an extra package to a Python Dataflow project to run on GCP

The Problem

The documentation for how to deploy a pipeline with extra, non-PyPi, pure Python packages on GCP is missing some detail. This gist shows how to package and deploy an external pure-Python, non-PyPi dependency to a managed dataflow pipeline on GCP.

TL;DR: You external package needs to be a python (source/binary) distro properly packaged and shipped alongside your pipeline. It is not enough to only specify a tar file with a setup.py.

Preparing the External Package

Your external package must have a proper setup.py. What follow is an example setup.py for our ETL package. This is used to package version 1.1.1 of the etl library. The library requires 3 native PyPi packages to run. These are specified in the install_requires field. This package also ships with custom external JSON data, declared in the package_data section. Last, the setuptools.find_packages function searches for all available packages and returns that list: :

# ETL's setup.py
from setuptools import setup, find_packages
setup(
    name='etl',
    version='1.1.1',
    install_requires=[
        'nose==1.3.7',
        'datadiff==2.0.0',
        'unicodecsv==0.14.1'
    ],
    description='ETL tools for API v2',
    packages = find_packages(),
    package_data = {
        'etl.lib': ["*.json"]
    }
)

Otherwise, there is nothing special about this setup.py file.

Building the External Package

You need to create a real source (or binary?) distribution of your external package. To do so, run the following in your external package's directory:

python setup.py sdist --formats=gztar

The last few lines you should see will look like this,

hard linking etl/transform/user.py -> etl-1.1.1/etl/transform
Writing etl-1.1.1/setup.cfg
Creating tar archive
removing 'etl-1.1.1' (and everything under it)

The output of this command, if run successfully, is a source distribution of your package, suitable to including into the pipeline project. Look in your ./dist directory for the file:

14:55 $ ls dist
etl-1.1.1.tar.gz

Preparing the Pipeline Project

  • Create a distlib or extra packages directory in which you will place the file you just created in the previous step:
cd pipeline-project
mkdir dist/
cp ~/etl-project/dist/etl-1.1.1.tar.gz .
  • Let the pipeline know you intend to include this package, by using the --extra-package command line argument:
18:38 $ python dataflow_main.py \
    --input=staging
    --output=bigquery-staging
    --runner=DataflowPipelineRunner
    --project=realmassive-staging
    --job_name dataflow-project-1
    --setup_file ./setup.py
    --staging_location gs://dataflow/staging
    --temp_location gs://dataflow/temp
    --requirements_file requirements.txt
    --extra_package dist/etl-1.1.1.tar.gz 
@elitongadotti
Copy link

What about when Dataflow pipeline is a package itself and so I need to reference a package from another?
I've tried all options I could but no success so far 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment