The documentation for how to deploy a pipeline with extra, non-PyPi, pure Python packages on GCP is missing some detail. This gist shows how to package and deploy an external pure-Python, non-PyPi dependency to a managed dataflow pipeline on GCP.
TL;DR: You external package needs to be a python (source/binary) distro properly packaged and shipped alongside your pipeline. It is not enough to only specify a tar file with a setup.py
.
Your external package must have a proper setup.py
. What follow is an example setup.py
for our ETL
package. This is used to package version 1.1.1 of the etl library. The library requires 3 native PyPi packages to run. These are specified in the install_requires
field. This package also ships with custom external JSON data, declared in the package_data
section. Last, the setuptools.find_packages
function searches for all available packages and returns that