-
Notifications
You must be signed in to change notification settings - Fork 5
Collect data resources together as a "data package" #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
90f0a3a
initial work towards step 5
d33bs 0da358d
linting
d33bs 5d19a79
add better context to readme
d33bs 5428e10
spacing
d33bs 46f0021
initial movie frame image extraction work
d33bs 9494d68
migrate from dagger to docker
d33bs 08fae69
write tiff data to parquet and write to lancedb
d33bs ce73ad3
updates based on comments from review
d33bs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| # See https://pre-commit.com for more information | ||
| # See https://pre-commit.com/hooks.html for more hooks | ||
| default_language_version: | ||
| python: python3.11 | ||
| files: > | ||
| (?x)^( | ||
| 5.data_packaging/.* | | ||
| pyproject.toml | | ||
| poetry.lock | | ||
| README.md | ||
| )$ | ||
| repos: | ||
| - repo: https://github.com/pre-commit/pre-commit-hooks | ||
| rev: v4.5.0 | ||
| hooks: | ||
| - id: trailing-whitespace | ||
| - id: end-of-file-fixer | ||
| - id: check-added-large-files | ||
| - id: check-toml | ||
| - repo: https://github.com/codespell-project/codespell | ||
| rev: v2.2.6 | ||
| hooks: | ||
| - id: codespell | ||
| exclude: > | ||
| (?x)^( | ||
| .*\.lock|.*\.csv | ||
| )$ | ||
| - repo: https://github.com/psf/black | ||
| rev: 24.3.0 | ||
| hooks: | ||
| - id: black | ||
| - repo: https://github.com/PyCQA/isort | ||
| rev: 5.13.2 | ||
| hooks: | ||
| - id: isort | ||
| - repo: https://github.com/python-poetry/poetry | ||
| rev: "1.8.0" | ||
| hooks: | ||
| - id: poetry-check | ||
| - repo: https://github.com/hadolint/hadolint | ||
| rev: v2.12.0 | ||
| hooks: | ||
| - id: hadolint-docker |
511 changes: 511 additions & 0 deletions
511
1.idr_streams/stream_files/idr0013-screenA-plates-w-colnames.tsv
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # This Dockerfile is for running bfconvert for this project. | ||
| # See here for more: | ||
| # https://bio-formats.readthedocs.io/en/v7.3.0/users/comlinetools/conversion.html | ||
|
|
||
| # base image java | ||
| FROM openjdk:22-slim | ||
|
|
||
| # provide a version argument | ||
| ARG version=7.2.0 | ||
|
|
||
| # set the workdir to /app | ||
| WORKDIR /app | ||
|
|
||
| # create a directory for the application files | ||
| RUN mkdir -p /opt/bftools | ||
|
|
||
| # install required packages | ||
| # hadolint ignore=DL3008 | ||
| RUN apt-get update \ | ||
| && apt-get install --no-install-recommends -y \ | ||
| wget \ | ||
| unzip \ | ||
| && apt-get clean \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # download and unzip bftools | ||
| RUN wget --progress=dot:giga \ | ||
| https://downloads.openmicroscopy.org/bio-formats/$version/artifacts/bftools.zip \ | ||
| -O /opt/bftools/bftools.zip \ | ||
| && unzip /opt/bftools/bftools.zip -d /opt | ||
|
|
||
| # Set the entrypoint for bfconvert | ||
| ENTRYPOINT ["/opt/bftools/bfconvert"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # 5. Data Packaging | ||
|
|
||
| In this module, we collect and package data created by this project. | ||
| Packaging involves the process of making the data easier to both store and use by others. | ||
| This portion of the project strives to be additive-only, attempting "First do no harm." through "yes and" focus. | ||
|
|
||
| A story to help describe goals: | ||
|
|
||
| _"As a research data participant I need a way to analyze (understand, contextualize, and explore) and implement (engineer solutions which efficiently scale for time and computing resources) the data found here in order to effectively reproduce findings, make new discoveries, and avoid challenging (or perhaps incorrect) translations individually."_ | ||
|
|
||
| We tried to think about the "research data participant" here with empathy; we can reduce barriers for other people by readying the data for use outside of this project. | ||
| If the barriers are high, a person may use the data incorrectly or opt to not use it at all. | ||
| We can't know all the reasons why or how someone might use the findings here but we can empathize for them through how much time cost they may face to use it. | ||
| A side-effect of thinking this way is that we also can benefit one another (we all face similar challenges). | ||
|
|
||
| Proposed solutions: | ||
|
|
||
| - Use named column data tables at a bare minimum to increase data understandability by the audience. | ||
| - Use data-typed in-memory and file-based formats to ensure consistent handling of data once read from (potentially un-typed) data sources. | ||
| - Store data in high-performance file-formats for distribution and scalable implementation by other people. | ||
| - Share data schema upfront to indicate data translation outside of in-process observation. | ||
| - Containerize OS-level dependencies to ensure reproducibility. | ||
| - Rewrite IDR data extraction to use FTP as a simple and currently documented procedure. | ||
| - Create avenues for reuse where possible to help increase the chances of multiplied benefit beyond just this PR. | ||
|
|
||
| ## Development | ||
|
|
||
| This module leverages system-available Python, [Poetry](https://github.com/python-poetry/poetry), and [Poe the Poet](https://poethepoet.natn.io/index.html) (among other dependencies found in the `pyproject.toml` file) to complete tasks. | ||
| This module also leverages Docker to reproducibly leverage additional tooling outside of Python dependencies. | ||
| We recommend installing Docker (suggested through [Docker Desktop](https://www.docker.com/products/docker-desktop/)), Python (suggested through [pyenv](https://github.com/pyenv/pyenv)) and Poetry (suggested through `pip install poetry`), then using the following to run the processes related to this step. | ||
|
|
||
| ```sh | ||
| # note: run these from the project root (one directory up). | ||
| # after installing poetry, create the environment | ||
| poetry install | ||
|
|
||
| # run the poe the poet task related to this step | ||
| # (triggers multiple Python modules) | ||
| poetry run poe package_data | ||
| ``` | ||
|
|
||
| ## Data Assets | ||
|
|
||
| The following data assets are included as part of the data package. | ||
|
|
||
| - mitocheck_metadata/features.samples-w-colnames.txt | ||
| - mitocheck_metadata/idr0013-screenA-annotation.csv.gz | ||
| - 0.locate_data/locations/negative_control_locations.tsv | ||
| - 0.locate_data/locations/positive_control_locations.tsv | ||
| - 0.locate_data/locations/training_locations.tsv | ||
| - 1.idr_streams/stream_files/idr0013-screenA-plates-w-colnames.tsv | ||
| - 2.format_training_data/results/training_data__ic.csv.gz | ||
| - 2.format_training_data/results/training_data__no_ic.csv.gz | ||
| - 3.normalize_data/normalized_data/training_data__ic.csv.gz | ||
| - 3.normalize_data/normalized_data/training_data__no_ic.csv.gz | ||
| - 4.analyze_data/results/compiled_2D_umap_embeddings.csv | ||
| - 4.analyze_data/results/single_cell_class_counts.csv | ||
|
|
||
| ## Schema | ||
|
|
||
| The schema for the data assets mentioned above are stored as references to help understand how the data will translate into various databases or collections of data. | ||
| This information may be found within the `5.data_packaging/schema` directory as text versions of PyArrow Table schemas. | ||
|
|
||
| ## Data Asset Column Labeling | ||
|
|
||
| Some data assets are duplicated to help label their columns where the originals may not include them. | ||
| These column names help provide context on what the row values contain and make working with the data more observable within Arrow, Lance, and other technologies used here. | ||
| These are presented below as pairs (with the `-w-colnames` indicating data which uses updated labeling for column names). | ||
|
|
||
| - mitocheck_metadata/features.samples.txt | ||
| - mitocheck_metadata/features.samples-w-colnames.txt | ||
| - 1.idr_streams/stream_files/idr0013-screenA-plates.tsv | ||
| - 1.idr_streams/stream_files/idr0013-screenA-plates-w-colnames.tsv |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| """ | ||
| Create various constants for use during data packaging. | ||
| """ | ||
|
|
||
| # list of files to be processed | ||
| DATA_FILES = [ | ||
| "mitocheck_metadata/features.samples.txt", | ||
| "mitocheck_metadata/features.samples-w-colnames.txt", | ||
| "mitocheck_metadata/idr0013-screenA-annotation.csv.gz", | ||
| "0.locate_data/locations/negative_control_locations.tsv", | ||
| "0.locate_data/locations/positive_control_locations.tsv", | ||
| "0.locate_data/locations/training_locations.tsv", | ||
| "1.idr_streams/stream_files/idr0013-screenA-plates.tsv", | ||
| "1.idr_streams/stream_files/idr0013-screenA-plates-w-colnames.tsv", | ||
| "2.format_training_data/results/training_data__ic.csv.gz", | ||
| "2.format_training_data/results/training_data__no_ic.csv.gz", | ||
| "3.normalize_data/normalized_data/training_data__ic.csv.gz", | ||
| "3.normalize_data/normalized_data/training_data__no_ic.csv.gz", | ||
| "4.analyze_data/results/compiled_2D_umap_embeddings.csv", | ||
| "4.analyze_data/results/single_cell_class_counts.csv", | ||
| ] | ||
|
|
||
| # create a copy of the data files, removing any which don't include column names | ||
| DATA_FILES_W_COLNAMES = [ | ||
| file | ||
| for file in DATA_FILES | ||
| if file | ||
| not in [ | ||
| "mitocheck_metadata/features.samples.txt", | ||
| "1.idr_streams/stream_files/idr0013-screenA-plates.tsv", | ||
| ] | ||
| ] | ||
|
|
||
| PACKAGING_FILES = ["5.data_packaging/location_and_ch5_frame_image_data.parquet"] | ||
|
|
||
| # FTP resources for accessing IDR | ||
| # See here for more: | ||
| # https://idr.openmicroscopy.org/about/download.html | ||
| FTP_IDR_URL = "ftp.ebi.ac.uk" | ||
d33bs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| FTP_IDR_USER = "anonymous" | ||
| FTP_IDR_MITOCHECK_CH5_DIR = ( | ||
| "/pub/databases/IDR/idr0013-neumann-mitocheck/20150916-mitocheck-analysis/mitocheck" | ||
| ) | ||
|
|
||
| DOCKER_PLATFORM = "linux/amd64" | ||
d33bs marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| """ | ||
| Python module for packaging data in lancedb database. | ||
| """ | ||
|
|
||
| import pathlib | ||
|
|
||
| import duckdb | ||
| import lancedb | ||
| import pandas as pd | ||
| import pyarrow as pa | ||
| from constants import DATA_FILES_W_COLNAMES, PACKAGING_FILES | ||
| from pyarrow import parquet | ||
|
|
||
| # specify a dir where the lancedb database may go and create lancedb client | ||
| lancedb_dir = pathlib.Path("5.data_packaging/packaged/lancedb/mitocheck_data") | ||
d33bs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ldb = lancedb.connect(lancedb_dir) | ||
|
|
||
|
|
||
| def get_arrow_tbl_from_csv(filename_read: str) -> str: | ||
| """ | ||
| Get an Arrow table from a CSV file through DuckDB. | ||
d33bs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Args: | ||
| filename_read (str): | ||
| The path to the CSV file to be read. | ||
|
|
||
| Returns: | ||
| str: | ||
| A string representing the Arrow table obtained from the CSV file. | ||
|
|
||
| """ | ||
|
|
||
| # try to read a typed arrow table | ||
| # falling back to a high-memory (string-focused) pandas | ||
| # dataframe read converted to arrow | ||
| try: | ||
| with duckdb.connect() as ddb: | ||
| return ddb.execute( | ||
| f""" | ||
| SELECT * | ||
| FROM read_csv('{filename_read}'); | ||
| """ | ||
| ).arrow() | ||
| except duckdb.duckdb.ConversionException: | ||
| return pa.Table.from_pandas( | ||
| df=pd.read_csv(filepath_or_buffer=filename_read, low_memory=False), | ||
| ) | ||
| except: | ||
| raise | ||
|
|
||
|
|
||
| # send csv file data as arrow tables to a lancedb database | ||
| for filename in DATA_FILES_W_COLNAMES: | ||
| table = get_arrow_tbl_from_csv(filename_read=filename) | ||
| ldb.create_table( | ||
| name=f"{filename.replace('/','.')}", | ||
| data=table, | ||
| schema=table.schema, | ||
| mode="overwrite", | ||
| ) | ||
|
|
||
| for filename in PACKAGING_FILES: | ||
| table = parquet.read_table(source=filename) | ||
| ldb.create_table( | ||
| name=f"{filename.replace('/','.')}", | ||
| data=table, | ||
| schema=table.schema, | ||
| mode="overwrite", | ||
| ) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This README perfectly clarifies the goal, wonderful job!
I fully understand the point is not related to IDR_stream but to make the output of IDR_stream better accessible.