Google Cloud Storage to BigQuery using Dataflow

Intro

Cloud Dataflow is a managed service for executing a wide variety of data processing patterns.

This documentation shows you how to deploy your batch pipelines using Cloud Dataflow in Java. The goal here is to create a pipeline to load data from a CSV file to BigQuery table.

Prerequisites

Eclipse 4.6 +
JDK 1.8+
GCP account
- GCS and BigQuery enabled
- Service account

Detailed steps

Install Cloud SDK

Follow this page to install Cloud SDK.

https://cloud.google.com/sdk/docs/quickstart-windows

Install Cloud Tools for Eclipse

Install Cloud Tools for Eclipse by following this page:

https://cloud.google.com/eclipse/docs/quickstart

Create a DataFlow project

Create a new project through New Project wizard.

Select Google Cloud Dataflow Java Project wizard. Click Next to continue.

Input the details for this project:

Click Next. Setup account details:

Click Finish to complete the wizard.

Build the project

Run Maven Install to install the dependencies. You can do this through Run Configurations or Maven command line interfaces.

Create a pipeline to load CSV file in GCS to BigQuery

Upload the CSV file into a bucket.

Refer the code from class CsvToBQPipeline and create a new one.

Create a run configuration for Dataflow pipeline:

For Pipeline Arguments tab, choose DirectRunner to run job on local machine.

Check Use default Dataflow options and then run the code.

View the job in Console

You can also run through DataflowRunner (set through Pipeline Arguments tab). The job will then be submitted to Dataflow in GCP.

Verify the result in BigQuery

Once data is loaded, you can run the following query to query it:

SELECT * FROM `project-id.dataset.table-name` LIMIT 1000

Conclusion

This demonstrates using a fully managed and reliable service to transform data in batch modes. We can create a pipeline to process historical data which can be generated from various applications.

Once the data is ingested in BigQuery, it can be used for further analysis, visualization, training ML models, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src/main/java/com/gcp		src/main/java/com/gcp
target/classes		target/classes
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Cloud Storage to BigQuery using Dataflow

Intro

Prerequisites

Detailed steps

Install Cloud SDK

Install Cloud Tools for Eclipse

Create a DataFlow project

Build the project

Create a pipeline to load CSV file in GCS to BigQuery

View the job in Console

Verify the result in BigQuery

Conclusion

About

Releases

Packages

Languages

sanveer-osahan/GoogleCloudStorage_to_BigQuery

Folders and files

Latest commit

History

Repository files navigation

Google Cloud Storage to BigQuery using Dataflow

Intro

Prerequisites

Detailed steps

Install Cloud SDK

Install Cloud Tools for Eclipse

Create a DataFlow project

Build the project

Create a pipeline to load CSV file in GCS to BigQuery

View the job in Console

Verify the result in BigQuery

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages