Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressable data store (aka CID store) #5715

Draft
wants to merge 45 commits into
base: master
Choose a base branch
from
Draft

Addressable data store (aka CID store) #5715

wants to merge 45 commits into from

Conversation

pditommaso
Copy link
Member

@pditommaso pditommaso commented Jan 27, 2025

Tentative implementation for addressable data store (very basic POC so far).

Update on 1 Mar 2025 from #5787 by @jorgee

M1 Implementation of CID store for provenance

Changes:

  • CID store is specified by workflow.data.store.location
  • Workflow Hash is created based on the workflow and parameters description
  • workflow, tasks and outputs metadata are stored in <cid.store.location>/.meta
  • references to other cid metadata are cid://<workflow_hash|task_hash/output_target_path
  • CID NIO Filesystem to access data based on CIS URLs
  • nextflow cid command to log, show and get lineage from CID store metadata

Known Limitations:

  • Outputs which are not published in absolutePaths or URLs which are not subfolders both the outputDir, we can not infer the relative output target path. They are not currently tracked in the CID store. We could create a hash for the parent directory of the URL or absolute path and use it as relative folder.

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso marked this pull request as draft January 27, 2025 13:15
Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit db79c43
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67e13814f3ce8a0008abbe97

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
@pditommaso
Copy link
Member Author

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

@jorgee
Copy link
Contributor

jorgee commented Feb 13, 2025

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

I have reverted the changes in this branch and created a new one in PR #5787

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
jorgee and others added 15 commits February 17, 2025 18:16
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@pditommaso
Copy link
Member Author

Considering the checksum can be computed in different ways, it should be tracked the algorithm used along with the checksum value. I've added a "task" checkbox about this in the comment above.

@pditommaso
Copy link
Member Author

Minor, it may be better to rename RUN CID to RESULT CID for consistency (or add both, maybe?)

» nextflow -c data.config cid log
TIMESTAMP          	RUN NAME    	SESSION ID                          	RUN CID                               
2025-03-04 16:37:35	elated_booth	5b442940-9c88-4cf8-b02e-72281460bd57	cid://e9aa28d4be9330a0ee857494ee3b6976
nextflow (cid-store)» nextflow -c data.config cid show cid://e9aa28d4be9330a0ee857494ee3b6976
{
    "type": "WorkflowResults",
    "run": "cid://7227b4647ce8331a47ee26f9944e832c",
    "outputs": [
        "cid://7227b4647ce8331a47ee26f9944e832c/fastqc_ggal_gut_logs",
        "cid://7227b4647ce8331a47ee26f9944e832c/multiqc_report.html"
    ]
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgee you can use the onWorkflowPublish event to capture the output metadata (i.e. annotations) from the workflow outputs. This event comes from PublishOp if you want to see how it works. Alternatively, we can modify PublishOp to send the entire metadata for an output when it's done.

I might sketch a PR for this later if I have some time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bentsherman do you know a pipeline, example or test where it is used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgee
Copy link
Contributor

jorgee commented Mar 10, 2025

Considering the checksum can be computed in different ways, it should be tracked the algorithm used along with the checksum value. I've added a "task" checkbox about this in the comment above.

I am considering adding the algorithm inside the checksum field like the container image digest.

checksum="<algorithm>:<hash>"

However, the hashing algorithm used by Nexflow depends on the mode and type of data. I should replicate the same code to extract the algorithm, and if I am not wrong, it could also be a combination of algorithms. For instance, in the case of files or directories, it is using a sha-256 hash pass to the default murmur3_128 hasher. So at this moment, I will put something like checksum="nf-<mode>:<hash>", to indicate we are using the Nextflow hashing mechanism with the mode indicated.

@pditommaso
Copy link
Member Author

Not understanding the rationale of the mode over the actual algorithm

@bentsherman
Copy link
Member

If he's using the HashBuilder to hash files, then the checksum will be different based on whether the process cache directive is set to standard, lenient, deep, etc

@jorgee
Copy link
Contributor

jorgee commented Mar 10, 2025

Yes, I am currently using the HashBuilder.

@pditommaso
Copy link
Member Author

I'd considering using something like to avoid collapsing everything in the prefix (and make more extensible if needed)

checksum: { value:<>, alg:<>, mode:<> }

jorgee and others added 4 commits March 11, 2025 17:45
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Co-authored-by: jorgee <jorge.ejarque@seqera.io>
@pditommaso
Copy link
Member Author

pditommaso commented Mar 15, 2025

Minor, this warning message should not container the object address

WARN: Can't read CID history file: nextflow.data.cid.CidHistoryFile@43b172e3

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Co-authored-by: jorgee <jorge.ejarque@seqera.io>
@bentsherman bentsherman added this to the 25.04.0 milestone Mar 20, 2025
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Co-authored-by: jorgee <jorge.ejarque@seqera.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants