Skip to content

Conversation

@bpblanken
Copy link
Collaborator

@bpblanken bpblanken commented Dec 11, 2024

Adds helper functions for querying tdr and the underlying big query stores. Does not manage the persistence of these into the pipeline!

@bpblanken bpblanken changed the title Benb/fetch predicted sex from tdr Add helper function for querying Terra Data Repository Dec 11, 2024
@bpblanken bpblanken changed the title Add helper function for querying Terra Data Repository Add helper functions for querying Terra Data Repository Dec 11, 2024
@bpblanken bpblanken marked this pull request as ready for review December 11, 2024 19:01
@bpblanken bpblanken requested a review from a team as a code owner December 11, 2024 19:01
status_forcelist=[500, 502, 503, 504],
)
s.mount('https://', HTTPAdapter(max_retries=retries))
s = requests_retry_session()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍



def gen_bq_table_names() -> Generator[str]:
with ThreadPoolExecutor(max_workers=5) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you chose 5 max_workers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe ai was responsible for this choice 🍭.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it a constant on the downstream pr.

)


def _get_dataset_ids() -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we calling this on every pipeline run? It fetches all of the datasets in the TDR, is there a way to filter it first by dataset name perhaps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This should be called every pipeline run, but it's only a single API request. The plan was to use the result of this request + a persisted list of dataset ids that we've seen before to create a "new dataset ids" list that would be passed into gen_bq_sample_metrics.

Base automatically changed from benb/add_service_account_credentialing to dev December 13, 2024 14:41
@bpblanken bpblanken merged commit 8b58c01 into dev Dec 13, 2024
2 checks passed
@bpblanken bpblanken deleted the benb/fetch_predicted_sex_from_tdr branch December 13, 2024 16:48
bpblanken added a commit that referenced this pull request Dec 13, 2024
* Add service account credentialing (#997)

* Add service account credentialing

* ruff

* feat: Handle parsing empty predicted sex into Unknown (#1000)

* Add helper functions for querying `Terra Data Repository` (#998)

* Add service account credentialing

* ruff

* First pass

* tests passing

* add coverage of bigquery test

* change function names

* use generators everywhere

* bq requirement

* resolver

* Update sample id name

* Build Sex Check Table from TDR Metrics (#999)
bpblanken added a commit that referenced this pull request Jan 7, 2025
* Add service account credentialing (#997)

* Add service account credentialing

* ruff

* feat: Handle parsing empty predicted sex into Unknown (#1000)

* Add helper functions for querying `Terra Data Repository` (#998)

* Add service account credentialing

* ruff

* First pass

* tests passing

* add coverage of bigquery test

* change function names

* use generators everywhere

* bq requirement

* resolver

* Update sample id name

* Build Sex Check Table from TDR Metrics (#999)

* refactor: Move feature flags to FeatureFlag enum. (#1002)

* refactor: Move feature flags out of environment to their own dataclass

* lint: ruff

* ruff

* bugfix: exclude samples from relationship checking that are not present in the expected loadable samples (#1003)

* bugfix: exclude samples from relationship checking that are not present in the expected loadable samples

* cleanup

* feat: add remap and family loading failures as validation exceptions … (#1005)

* feat: add remap and family loading failures as validation exceptions rather than runtime errors

* move on

* Update write_remapped_and_subsetted_callset_test.py

* ruff

* feat: Add ability to run tasks dataproc. (#948)

* Support gcs dirs in rsync

* ws

* Add create dataproc cluster task

* add dataproc

* ruff

* requirements

* still struggling

* Gencode refactor to remove gcs

* bump reqs

* Run dataproc job

* lib

* running

* merge requirements

* Flip'em

* Better exception handling

* Cleaner approach if less generalizable

* write a test

* Fix tests

* lint

* Add test for success

* refactor to use a base class... better for adding support for multiple jobs

* cleanup

* ruff

* Fix missing mock

* Fix flapping test

* pr comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants