Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. This guides you through all the steps needed to run an Apache Beam pipeline in the Google Cloud Dataflow runner.
The following instructions help you prepare your Google Cloud project.
-
Install the Cloud SDK.
ℹ️ This is not required in Cloud Shell since it already has the Cloud SDK pre-installed.
-
Create a new Google Cloud project and save the project ID in an environment variable.
Click here to create a new project
# Save your project ID in an environment variable for ease of use later on. export PROJECT=your-google-cloud-project-id
-
Setup the Cloud SDK to your GCP project.
gcloud init
-
Enable the Dataflow API.
-
Authenticate to your Google Cloud project.
gcloud auth application-default login
ℹ️ For more information on authentication, see the Authentication overview page.
To learn more about the permissions needed for Dataflow, see the Dataflow security and permissions page.
For instructions on how to install Python, virtualenv, and the Cloud SDK, see the Setting up a Python development environment guide.