The documentation for how to deploy a pipeline with extra, non-PyPi, pure Python packages on GCP is missing some detail. This gist shows how to package and deploy an external pure-Python, non-PyPi dependency to a managed dataflow pipeline on GCP.
TL;DR: You external package needs to be a python (source/binary) distro properly packaged and shipped alongside your pipeline. It is not enough to only specify a tar file with a setup.py
.
Your external package must have a proper setup.py
. What follow is an example setup.py
for our ETL
package. This is used to package version 1.1.1 of the etl library. The library requires 3 native PyPi packages to run. These are specified in the install_requires
field. This package also ships with custom external JSON data, declared in the package_data
section. Last, the setuptools.find_packages
function searches for all available packages and returns that list:
:
# ETL's setup.py
from setuptools import setup, find_packages
setup(
name='etl',
version='1.1.1',
install_requires=[
'nose==1.3.7',
'datadiff==2.0.0',
'unicodecsv==0.14.1'
],
description='ETL tools for API v2',
packages = find_packages(),
package_data = {
'etl.lib': ["*.json"]
}
)
Otherwise, there is nothing special about this setup.py
file.
You need to create a real source (or binary?) distribution of your external package. To do so, run the following in your external package's directory:
python setup.py sdist --formats=gztar
The last few lines you should see will look like this,
hard linking etl/transform/user.py -> etl-1.1.1/etl/transform
Writing etl-1.1.1/setup.cfg
Creating tar archive
removing 'etl-1.1.1' (and everything under it)
The output of this command, if run successfully, is a source distribution of your package, suitable to including into the pipeline project. Look in your ./dist
directory for the file:
14:55 $ ls dist
etl-1.1.1.tar.gz
- Create a
distlib
or extra packages directory in which you will place the file you just created in the previous step:
cd pipeline-project
mkdir dist/
cp ~/etl-project/dist/etl-1.1.1.tar.gz .
- Let the pipeline know you intend to include this package, by using the
--extra-package
command line argument:
18:38 $ python dataflow_main.py \
--input=staging
--output=bigquery-staging
--runner=DataflowPipelineRunner
--project=realmassive-staging
--job_name dataflow-project-1
--setup_file ./setup.py
--staging_location gs://dataflow/staging
--temp_location gs://dataflow/temp
--requirements_file requirements.txt
--extra_package dist/etl-1.1.1.tar.gz
I had a more specific need. I needed to be able to install mysql_client on the box as well. So I ended up running my requirements as root commands, and then packaging my directory accordingly. Rather than previously packaging up my external package. In fact, I can just specify the path (using a symbolic link) and let Dataflow process it properly.
Here's my directory structure:
Dataflow
--Core
-- owlet
owlet
is my pipeline codebase and is a relative symbolic link since all my code is in the same directory.Here's my setup.py folder: