Skip to content

Latest commit

 

History

History
 
 

dataflow-python

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Dataflow Python builder

Introduction

Google Cloud Dataflow, based on Apache Beam, is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness. Developers and Data Scientists use Dataflow to process large amounts of data without managing complex cluster infrastructure.

Google Cloud Build offers a number of advantages for Cloud Dataflow developers:

  • Small workloads which run in a n1-standard-1 virtual machine can take advantage of the free tier, which provides 120 free build-minutes per day
  • Workflows start very quickly, typically within a few seconds (depending on the size of your container)
  • Pipelines get all the benefits of containerization, including a consistent environment and integration with your CI/CD flow
  • Cloud Build supports automatic triggering from Github, Bitbucket and Google Cloud Source Repositories, so you can configure your data warehouse to automatically update when the pipeline code changes
  • Pipelines can be initiated by a simple REST API.

The builder supports both Dataflow execution modes:

  • DirectRunner runs your code in-process inside Cloud Build, taking advantage of the fast start and free tier pricing
  • DataflowRunner starts workers on Compute Engine, allowing for massive scalability.

This builder supports the Cloud Dataflow Python API.

Usage

If this is your first time using Cloud Build, follow the Quickstart for Docker to get started.

Then, clone this code and build the builder:

gcloud builds submit --config=cloudbuild.yaml .

To access resources on Google Compute Platform from your pipeline - whether Cloud Storage, BigQuery datasets, or Dataflow runners - issue the following commands to permission your Cloud Build service account:

# Setup IAM bindings
export PROJECT=$(gcloud info --format='value(config.project)')
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT --format 'value(projectNumber)')
export CB_SA_EMAIL=$PROJECT_NUMBER@cloudbuild.gserviceaccount.com
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/iam.serviceAccountUser' --role='roles/iam.serviceAccountActor' --role='roles/dataflow.admin’
# Enable Dataflow API
gcloud services enable dataflow.googleapis.com
# Setup GCS bucket
gsutil mb gs://cloudbuild-dataflow-$PROJECT

Python notes

Python has several different dependency management tools, which interact in different ways with containers. In this case we use virtualenv to setup an isolated folder inside the container with the libraries we need. As a result of this, be sure that your first build step loads the virtualenv environment. See examples for details.

Additional libraries can be added by creating another container based on this one, for example:

FROM gcr.io/my-project/dataflow-python

RUN /bin/bash -c "source venv/bin/activate"

RUN pip install my-library
...

This container uses Python 2. Python 3 is currently under development: see BEAM-1251.

Examples

For examples, see the examples directory.