Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial data portal endpoints #324

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Initial data portal endpoints #324

wants to merge 17 commits into from

Conversation

jeffbaumes
Copy link
Collaborator

No description provided.

jeffbaumes and others added 17 commits October 11, 2023 10:30
* replace dead link 'nmdc-metadata' with 'issues' repo

* update name of make command to ssh into nersc mongo dbs

---------

Co-authored-by: Jing - Peters MBP <jingcao.yale@gmail.com>
* initial checkin with base class and basic tests

- Base ChangeSheet write class
- unit tests for base class

* add conftest and gold changesheet tests

- move test fixtures to conftest.py
- add get_biosample_name function and unit test to GoldBiosample generator

* update biosample name unit test

add explicit expected values

* Sketch out functions for gold changesheet generator

* function and test for missing GOLD ecosystem metadata

* add function and test for missing gold_biosample_identifiers

* add get_normalized_gold_biosample_identifier

* update logic with omics processing step

* skeleton find_omics_processing_set function, and updated (correct this time) test data files

* Add Omics to Biosample map

- add omics_to_biosample map imput
- added nmdc / gold BioSample comparison logic
- unit tests
- stub API dependent methods

* Add changesheets.py pachage for common functions and classes

- Changesheet and ChangesheetLineItem classes
- API @op functions

* refactor to split omice procesing data file read to stand-aloine function

* more refactoring and code cleanup

* add test generation job

* add resource definitions and config

* refactor and code cleanup

Simplify to just ChangeSheet and ChangeSheerLineItem classes

* Cleanup this branch to focus on getting assets working

* fix defs and fetch statement

* get basic GOLD asset generation working

* Add Api resources as ConfigurableResources

* Add asset scaffolding

* update normalizer functions to all take and return strings

* update resources add empty click script

* fix gold ID normalization and add unit tests

* implement compare biosamples and write_changesheet

* add omics reccord comparison

* Add validate_changesheet method

* cleanup unused data files

* fix validate_changesheet method and add logging

* delete dagster asset based code and tests - move to a demo branch

* add changesheet_output to .gitignore

* add changesheet_output to .gitignore

* remove Dagster-related code and settings

* style: format with black

* Use TypeAlias for JSON_OBJECT

* Removed hard-coded URL from Changesheet.validate()

* remove .tsv file - should be ignorewd

* clarify function name and blacken formatting

* fix click options help text and blacken

* yet more blackening

* uncomment wait-for-it

* Delete get_data.ipynb

* Revert "Delete get_data.ipynb"

This reverts commit fe3e68a.

* add docstring for generate_changesheet

* automatic reformatting

* bring get_data noteback back to original state

* add some logging

* update to use gold_sequencing_identifiers over alternative_identifiers

* Delete neon_cache.sqlite

* strip and de-tab the value in tsv output

* set default line_items in changesheet class correctly

* update output_dir type hint

* remove apply_changes option

* Dry up unfindable logging

* Clean up gold normalization and documentation

* fix: style

---------

Co-authored-by: Donny Winston <donny@polyneme.xyz>
* fix: run `bump-pydantic nmdc_runtime` and apply

closes #339

addresses #343

* fix: @model_validator refactor

closes #343
model-field ranges with `Query`-annotated types aren't covered by the automated bump-pydantic tool.
re-submission of "same" changes is a valid use case

closes #340
@dwinston
Copy link
Collaborator

dwinston commented Nov 3, 2023

coordinated with microbiomedata/nmdc-server#1037

@dwinston
Copy link
Collaborator

dwinston commented Nov 3, 2023

note: denormalization of mongo collections for data portal, via a series of mongo aggregation pipelines (python nmdc_runtime/api/endpoints/portal_denormalize.py), takes approx 50s on my laptop, against a local mongorestore of the production db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants