Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. This guides you through all the steps needed to run an Apache Beam pipeline in the Google Cloud Dataflow runner.
The following instructions help you prepare your Google Cloud project.
-
Install the Cloud SDK.
ℹ️ This is not required in Cloud Shell since it already has the Cloud SDK pre-installed.
-
Create a new Google Cloud project via the New Project page, or via the
gcloud
command line tool.export PROJECT=your-google-cloud-project-id gcloud projects create $PROJECT
-
Setup the Cloud SDK to your GCP project.
gcloud init
-
Enable the APIs: Dataflow, Compute Engine, Cloud Storage, Cloud Storage JSON, Stackdriver Logging, Cloud Resource Manager, and IAM API.
-
Create a service account JSON key via the Create service account key page.
export PROJECT=$(gcloud config get-value project) export SA_NAME=samples export IAM_ACCOUNT=$SA_NAME@$PROJECT.iam.gserviceaccount.com # Create the service account. gcloud iam service-accounts create $SA_NAME --display-name $SA_NAME # Set the role to Project Owner (*). gcloud projects add-iam-policy-binding $PROJECT \ --member serviceAccount:$IAM_ACCOUNT \ --role roles/owner # Create a JSON file with the service account credentials. export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS \ --iam-account=$IAM_ACCOUNT
ℹ️ The Role field authorizes your service account to access resources. You can view and change this field later by using the GCP Console IAM page. If you are developing a production app, specify more granular permissions than
roles/owner
.To learn more about roles in service accounts, see Granting roles to service accounts.
To learn more about service accounts, see Creating and managing service accounts
-
Set the
GOOGLE_APPLICATION_CREDENTIALS
to your service account key file.export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
The following instructions help you prepare your development environment.
-
Download and install the Java Development Kit. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
$JAVA_HOME/bin/java --version
-
Download and install Apache Maven by following the Maven installation guide for your specific operating system.
mvn --version
-
(Optional) Set up an IDE like IntelliJ, VS Code, Eclipse. NetBeans, etc.
The easiest way to create a new Apache Beam pipeline is through the starter Maven archetype.
export NAME=your-pipeline-name
export PACKAGE=org.apache.beam.samples
export JAVA_VERSION=11
# This creates a new directory with the pipeline's code within it.
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-starter \
-DtargetPlatform=$JAVA_VERSION \
-DartifactId=$NAME \
-DgroupId=$PACKAGE \
-DinteractiveMode=false
# Navigate to the pipeline contents.
cd $NAME
Make sure you have the latest plugin and dependency versions,
and update your pom.xml
file accordingly.
# Check your plugin versions.
mvn versions:display-plugin-updates
# Check your dependency versions.
mvn versions:display-dependency-updates
Finally, add the runners or I/O transforms you need into your pom.xml
file.