Skip to content

Loading data from CSV in Cloud Storage to BigQuery using Dataflow

Notifications You must be signed in to change notification settings

sanveer-osahan/GoogleCloudStorage_to_BigQuery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Google Cloud Storage to BigQuery using Dataflow

Intro

Cloud Dataflow is a managed service for executing a wide variety of data processing patterns.

This documentation shows you how to deploy your batch pipelines using Cloud Dataflow in Java. The goal here is to create a pipeline to load data from a CSV file to BigQuery table.

Prerequisites

Detailed steps

Install Cloud SDK

Follow this page to install Cloud SDK.

https://cloud.google.com/sdk/docs/quickstart-windows

Install Cloud Tools for Eclipse

Install Cloud Tools for Eclipse by following this page:

https://cloud.google.com/eclipse/docs/quickstart

Create a DataFlow project

Create a new project through New Project wizard.

Select Google Cloud Dataflow Java Project wizard. Click Next to continue.

NewDataflowProject

Input the details for this project:

ProjectDetails

Click Next. Setup account details:

AccountDetails

Click Finish to complete the wizard.

Build the project

Run Maven Install to install the dependencies. You can do this through Run Configurations or Maven command line interfaces.

Create a pipeline to load CSV file in GCS to BigQuery

Upload the CSV file into a bucket.

Refer the code from class CsvToBQPipeline and create a new one.

Create a run configuration for Dataflow pipeline:

RunConfiguration

For Pipeline Arguments tab, choose DirectRunner to run job on local machine.

PipelineArgumentsTab

Check Use default Dataflow options and then run the code.

View the job in Console

You can also run through DataflowRunner (set through Pipeline Arguments tab). The job will then be submitted to Dataflow in GCP.

ConsoleJob

Verify the result in BigQuery

Once data is loaded, you can run the following query to query it:

SELECT * FROM `project-id.dataset.table-name` LIMIT 1000

Conclusion

This demonstrates using a fully managed and reliable service to transform data in batch modes. We can create a pipeline to process historical data which can be generated from various applications.

Once the data is ingested in BigQuery, it can be used for further analysis, visualization, training ML models, etc.

About

Loading data from CSV in Cloud Storage to BigQuery using Dataflow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages