Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-1876] Add support for custom queries with GBQTableDataSet #442

Closed
ajb7 opened this issue Jul 21, 2020 · 2 comments
Closed

[KED-1876] Add support for custom queries with GBQTableDataSet #442

ajb7 opened this issue Jul 21, 2020 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ajb7
Copy link
Contributor

ajb7 commented Jul 21, 2020

Description

Current implementation: GBQTableDataSet loads data from Google BigQuery. It uses pandas-gbq to read from BigQuery table. In extras/datasets/pandas/gbq_dataset.py

    def _load(self) -> pd.DataFrame:
        sql = "select * from {}.{}".format(self._dataset, self._table_name)  # nosec
        return pd.read_gbq(
            sql,
            project_id=self._project_id,
            credentials=self._credentials,
            **self._load_args
        )

This works well when dataset is small to medium in size. However, when we have "Big Data" there is a need to get specific columns or specific rows or specific partition from the dataset; hence using custom queries with filters is required. The select * implementation in many ways, violates Best Practices to fetch data from GoogleBigQuery.

Context

To comply with the GBQ Best Practices in order to make the queries cost efficient as well as time efficient; GBQTableDataSet can empower developers to pass custom queries as arguments; instead of select * ; as in the current implementation.

Current implementation uses pd.read_gbq() to load data from Google Big Query. This function allows custom queries along with other arguments to be passed. The power of read_gbq() can be fully utilized by passing the allowed parameters of function as part of load_args in GBQTableDataSet.

As per documentation :

pandas.read_gbq(query: str, project_id: Union[str, NoneType] = None, 
 index_col: Union[str, NoneType] = None, col_order: Union[List[str], NoneType] = None, 
 reauth: bool = False, auth_local_webserver: bool = False, 
 dialect: Union[str, NoneType] = None, location: Union[str, NoneType] = None, 
 configuration: Union[Dict[str, Any], NoneType] = None, credentials=None, 
 use_bqstorage_api: Union[bool, NoneType] = None, private_key=None, 
 verbose=None, progress_bar_type: Union[str, NoneType] = None) → ’DataFrame’

To pass custom queries to GBQTableDataSet, catalog.yml will look like:

my_dataset:
  type: pandas.GBQTableDataSet
  dataset: gbq_dataset_name
  table_name: gbq_table_name
  project: gbq_project_name
  load_args:
    query: "Select col1, col2, col3 from project.dataset.table_name where col4 < 100"
@ajb7 ajb7 added the Issue: Feature Request New feature or improvement to existing feature label Jul 21, 2020
@mzjp2 mzjp2 changed the title Add support for Custom Queries to run on Google BigQuery using GBQTableDataSet [KED-1876] Add support for Custom Queries to run on Google BigQuery using GBQTableDataSet Jul 23, 2020
@mzjp2
Copy link
Contributor

mzjp2 commented Jul 23, 2020

Hey @ajb7, thanks for raising the issue - this is actually perfect. We've had similar feedback recently and have recently opened a ticket on our backlog to handle exactly this (we agree with all your comments) - we're super thankful to have you taking this up in the PR you've opened!

@mzjp2 mzjp2 changed the title [KED-1876] Add support for Custom Queries to run on Google BigQuery using GBQTableDataSet [KED-1876] Add support for custom queries with GBQTableDataSet Jul 23, 2020
@mzjp2
Copy link
Contributor

mzjp2 commented Jul 29, 2020

Closed in #443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

2 participants