[KED-1876] Add support for custom queries with GBQTableDataSet #442

ajb7 · 2020-07-21T23:52:13Z

Description

Current implementation: GBQTableDataSet loads data from Google BigQuery. It uses pandas-gbq to read from BigQuery table. In extras/datasets/pandas/gbq_dataset.py

    def _load(self) -> pd.DataFrame:
        sql = "select * from {}.{}".format(self._dataset, self._table_name)  # nosec
        return pd.read_gbq(
            sql,
            project_id=self._project_id,
            credentials=self._credentials,
            **self._load_args
        )

This works well when dataset is small to medium in size. However, when we have "Big Data" there is a need to get specific columns or specific rows or specific partition from the dataset; hence using custom queries with filters is required. The select * implementation in many ways, violates Best Practices to fetch data from GoogleBigQuery.

Context

To comply with the GBQ Best Practices in order to make the queries cost efficient as well as time efficient; GBQTableDataSet can empower developers to pass custom queries as arguments; instead of select * ; as in the current implementation.

Current implementation uses pd.read_gbq() to load data from Google Big Query. This function allows custom queries along with other arguments to be passed. The power of read_gbq() can be fully utilized by passing the allowed parameters of function as part of load_args in GBQTableDataSet.

As per documentation :

pandas.read_gbq(query: str, project_id: Union[str, NoneType] = None, 
 index_col: Union[str, NoneType] = None, col_order: Union[List[str], NoneType] = None, 
 reauth: bool = False, auth_local_webserver: bool = False, 
 dialect: Union[str, NoneType] = None, location: Union[str, NoneType] = None, 
 configuration: Union[Dict[str, Any], NoneType] = None, credentials=None, 
 use_bqstorage_api: Union[bool, NoneType] = None, private_key=None, 
 verbose=None, progress_bar_type: Union[str, NoneType] = None) → ’DataFrame’

To pass custom queries to GBQTableDataSet, catalog.yml will look like:

my_dataset:
  type: pandas.GBQTableDataSet
  dataset: gbq_dataset_name
  table_name: gbq_table_name
  project: gbq_project_name
  load_args:
    query: "Select col1, col2, col3 from project.dataset.table_name where col4 < 100"

The text was updated successfully, but these errors were encountered:

mzjp2 · 2020-07-23T17:14:02Z

Hey @ajb7, thanks for raising the issue - this is actually perfect. We've had similar feedback recently and have recently opened a ticket on our backlog to handle exactly this (we agree with all your comments) - we're super thankful to have you taking this up in the PR you've opened!

mzjp2 · 2020-07-29T13:38:45Z

Closed in #443

ajb7 added the Issue: Feature Request New feature or improvement to existing feature label Jul 21, 2020

ajb7 mentioned this issue Jul 22, 2020

[KED-1876] Allow GBQTableDataSet to optionally accept a sql query to load data #443

Merged

mzjp2 changed the title ~~Add support for Custom Queries to run on Google BigQuery using GBQTableDataSet~~ [KED-1876] Add support for Custom Queries to run on Google BigQuery using GBQTableDataSet Jul 23, 2020

mzjp2 changed the title ~~[KED-1876] Add support for Custom Queries to run on Google BigQuery using GBQTableDataSet~~ [KED-1876] Add support for custom queries with GBQTableDataSet Jul 23, 2020

mzjp2 closed this as completed Jul 29, 2020

Galileo-Galilei mentioned this issue Sep 20, 2021

Universal Kedro deployment (Part 2) - Offer a unified interface to external compute and storage backends #904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-1876] Add support for custom queries with GBQTableDataSet #442

[KED-1876] Add support for custom queries with GBQTableDataSet #442

ajb7 commented Jul 21, 2020

mzjp2 commented Jul 23, 2020

mzjp2 commented Jul 29, 2020

[KED-1876] Add support for custom queries with GBQTableDataSet #442

[KED-1876] Add support for custom queries with GBQTableDataSet #442

Comments

ajb7 commented Jul 21, 2020

Description

Context

mzjp2 commented Jul 23, 2020

mzjp2 commented Jul 29, 2020