Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include remote path in FileHolder #5911

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

rcannood
Copy link
Contributor

@rcannood rcannood commented Mar 21, 2025

Alternative solution to #5907 to implement #5905.

I didn't change any other calls of FileHolder except the one @bentsherman suggested in #5907.

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Copy link

netlify bot commented Mar 21, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 7ac07c9
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67dd63130209110008ca76ca

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
@rcannood rcannood force-pushed the add-origPath-to-FileHolder branch from ab13b83 to 0a2ae72 Compare March 21, 2025 09:42
@rcannood
Copy link
Contributor Author

Looking at the constructor of FileHolder, it's not straight forward to just add a Path origPath = null to the arguments of the constructor.

    FileHolder( Path inputFile ) {
        assert inputFile
        this.sourceObj = inputFile
        this.storePath = real(inputFile)
        this.stageName = norm(inputFile.getFileName())
    }

    FileHolder( def origin, Path path ) {
        assert origin != null
        assert path != null

        this.sourceObj = origin
        this.storePath = path
        this.stageName = norm(path.getFileName())
    }

    protected FileHolder( def source, Path store, def stageName ) {
        this.sourceObj = source
        this.storePath = store
        this.stageName = norm(stageName)
    }

I just added a class RemoteFileHolder which has a different set of constructors. However there are many more alternative solutions -- e.g. by adding a setter to FileHolder so it doesn't need to get passed to the constructor.

@bentsherman do you have any suggestions on which solution you think would be best?

This reverts commit 0a2ae72.

Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
Signed-off-by: Robrecht Cannoodt <rcannood@gmail.com>
@rcannood rcannood force-pushed the add-origPath-to-FileHolder branch from 4cf7ad5 to 1793efc Compare March 21, 2025 10:41
@bentsherman
Copy link
Member

Looks good. Thanks for your help 😄

Do you understand how to use this in your setup? I will also try it with nf-prov. Let me know if you need help with how to test on your end

@bentsherman bentsherman changed the title Add origPath as an argument to FileHolder Include remote path in FileHolder Mar 21, 2025
@rcannood
Copy link
Contributor Author

I wonder whether something needs to be updated in the constructors:

    FileHolder( Path inputFile ) {
        assert inputFile
        this.sourceObj = inputFile
        this.storePath = real(inputFile)
        this.stageName = norm(inputFile.getFileName())
    }
    FileHolder( def origin, Path path ) {
        assert origin != null
        assert path != null
        this.sourceObj = origin
        this.storePath = path // ← does this need to become `real(path)`?
        this.stageName = norm(path.getFileName())
    }
    protected FileHolder( def source, Path store, def stageName ) {
        this.sourceObj = source
        this.storePath = store // ← does this need to become `real(store)`?
        this.stageName = norm(stageName)
    }

Cfr the use case: I'm working on a nf-lamin plugin which allows users to run pre-existing Nextflow workflows and create records in the LaminDB regarding:

  • Which workflow was run
  • Information about the run itself
  • List of input files (e.g. s3://path/to/input/*.fastq.gz)
  • List of output files (e.g. s3://path/to/output/*.bam)

This package will use the observer to detect various events and then pass that information on to LaminDB.

@pditommaso
Copy link
Member

I think you should have a look at this on-going effort #5715

@bentsherman
Copy link
Member

I think this PR will benefit all of us -- the provenance store, nf-prov, and nf-lamin. Fundamentally it's about being able to trace staged files to their remote source, which we currently cannot do

The only snag is that this PR currently will change the task hash. When a task has a foreign input file, the locally staged path is used for the hash, not the remote path. This might actually be better, as it would decouple the task hash from the foreign file staging. But I wonder if the locally staged path was used for a reason

@bentsherman
Copy link
Member

bentsherman commented Mar 21, 2025

I have it working with nf-prov! nextflow-io/nf-prov#45

You can see the before/after in the linked PR. Turns out it doesn't affect the task hash. I think this is because the FilePorter uses the original URL to construct the hash for the stage directory, and since the file is hashed by either base name + size + timestamp (standard) or contents (deep), the remote file and staged file produce the same hash

@rcannood to your point about doing real(path), that function is mainly used to handle inputs that came from an upstream task, so it should not be needed here.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants