This repository contains data files and related scripts/notebooks used for testing and experimenting with BERtron-related services.
Important
The map web page (in the bertron repository) depends upon the following files in this repository remaining in their current locations:
emsl/map/all_emsl_samples.jsonemsl/map/latlon_project_ids.jsoness-dive/ess_dive_packages.csvjgi/jgi_gold_biosample_geo.csvjgi/jgi_gold_organism_geo.csvnmdc/nmdc_biosample_geo_coordinates.csv
Please ensure the map web page has been updated accordingly before you move or delete any of those files.
To ensure consistency, efficient validation, and compatibility with the latest release schema, all new data ingest processes must follow these conventions:
- All ingested data must reside within the top-level
ingest/directory. - Each data provider must have its own subfolder within
ingest/, e.g.:ingest/emsl/ingest/jgi/ingest/ess-dive/ingest/nmdc/
- All files must be formatted as JSON lists (i.e., data enclosed in square brackets).
- Each file contains only complete records (entities). No record may be split across files.
- Each record must be independently valid against the current release schema.
- (Future consideration: JSON Lines format may be adopted if more appropriate for downstream usage.)
- Each data file should not exceed approximately 25 MB.
- If the dataset is larger, split into multiple files. Do not split individual entity records across files.
- Document the splitting strategy if custom logic is required.
- Files must be named as
<data provider>_<padded 5 digit sequence>.json- Example:
emsl_00001.json,jgi_00005.json
- Example:
- Numbering must start at
00001for each provider and increment as needed.
ingest/
emsl/
emsl_00001.json
jgi/
jgi_00001.json
jgi_00002.json
ess-dive/
ess-dive_00001.json
nmdc/
nmdc_00001.json
- All ingests must support the latest release schema.
- JSON format (list or dict) is specified above; future migrations to JSON Lines will be documented separately.
- No records are to be split between files; each file is independently valid.
- For more information or updates to these conventions, see issue #9.
Tools, scripts, notebooks etc. for populating the data (for ingest into MongoDB) from each resouce should live in the contrib directory.
Each data provider has its own subfolder within contrib/, e.g.:
contrib/emsl/contrib/jgi/contrib/ess-dive/contrib/nmdc/