Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining metadata for output files for IRIDA Next #14

Open
apetkau opened this issue Jan 12, 2024 · 0 comments
Open

Defining metadata for output files for IRIDA Next #14

apetkau opened this issue Jan 12, 2024 · 0 comments

Comments

@apetkau
Copy link
Member

apetkau commented Jan 12, 2024

1. Problem Statement

There is a need to be able to indicate file types for different output files to store in IRIDA Next (e.g., reads, assemblies, mlst allele profiles).

1.1. Adjust metadata associated with files

One set of solutions focuses on adjusting metadata/attributes (such as type and format) in IRIDA Next. There may be a need to extend this metadata to include additional keys (such as scheme for a cg/wgMLST scheme). This needs to be conveyed/defined by a pipeline (or at least, when IRIDA Next stores analysis results files, it needs to set the correct keys to the correct values).

1.2. Adjust file suffixes

Another set of solutions focuses on adjusting output file suffixes (e.g., *.fastq.gz or *.fasta.gz).

2. Solution 0: IRIDA Next defines attributes of files

Here, IRIDA Next defines a set of attributes for attached files based in different criteria (such as file name, etc). This is currently implemented within IRIDA Next. An example of defined attributes can be found at:

https://github.com/phac-nml/irida-next/blob/72e0a6f32f932fa608ebccee0496b7c2d041e2db/test/fixtures/attachments.yml#L34-L43

attachmentPEREV1:
  metadata:
    {
      "type": "pe",
      "format": "fastq",
      "direction": "reverse",
      "compression": "none",
      "associated_attachment_id": <%= ActiveRecord::FixtureSet.identify(:attachmentPEFWD1) %>,
    }
  attachable: sampleB (Sample)

3. Solution 1: Define additional attributes in IRIDA Next

This solution involves adding additional attributes in IRIDA Next for key file formats and types we need to work with. This makes it easier to select these file types within pipelines or from the IRIDA Next API. These attributes will be determined by IRIDA Next.

3.1. Additional formats and types

  • format="json", type="mlst": For (cg/wg)MLST results stored as a JSON file.
  • format="fasta", type="assembly": For assembled genomes in fasta format
  • format="genbank", type="assembly": For assembled/annotated genomes in genbank format

3.2. Additional attributes

In addition to additional format/type attributes, it might make sense to define additional attributes which can be set. In particular:

  • mlst_scheme: This attribute defines the particular MLST scheme the associated file represents. This could be read from information within the MLST alleles JSON file.

3.3. Advantages

  • Minimal modifications needed for IRIDA Next

3.4. Disadvantages

  • Less flexible for extending file attributes in the future
  • Assignment of attributes determined entirely by pipeline developers

4. Solution 2: Define additional attributes in pipeline

Similar to Solution 1, IRIDA Next will include additional attributes associated with a file (such as mlst_scheme). However, the values of these attributes can be set by a pipeline in the iridanext.output.json output file.

Currently, the iridanext.output.json defines files associated with samples (or with the analysis pipeline as a whole) as a list of JSON objects which includes the key path. This solution would extend this JSON structure to add additional keys associated with files.

4.1. Example

For example, the following could be an iridanext.output.json output file:

iridanext.output.json

{
    "files": {
        "global": [ ],
        "samples": {
            "SampleA": [
                {"path": "assembly/assembly.fa.gz", "type": "assembly"},
                {"path": "mlst/alleles.json.gz", "type": "mlst", "mlst_scheme": "listeria-2024-01-01"}
            ]
        }
    },
}

That is, each file entry has an associated type, or mlst_scheme (or other defined keywords).

4.2. Providing keys to iridanext.output.json

If the nf-iridanext plugin was used to write the iridanext.output.json file, then the following Nextflow configuration could possibly be used to create the additional keys.

nextflow.config

iridanext {
    output {
        files {
            samples = [
                ["path": "**/assembly/*.assembly.fa.gz", "type": "assembly"],
                ["path": "**/mlst/alleles.json.gz", "type": "mlst", "mlst_scheme": "${params.scheme}"] 
            ]
        }
    }
}

Here, I assume that in the Nextflow pipeline --scheme is used to define the MLST scheme, which is passed as metadata to generate the final iridanext.output.json file.

4.3. Advantages

  • Pipeline developers can define the file attributes rather than code located in IRIDA Next.
    • This allows each pipeline to customize the type of attributes and values to use.

4.4. Disadvantages

  • More complicated code changes

4.5. Questions/Caveats

  • How to handle situations where both IRIDA Next and a pipeline attempt to write to the same attribute?

5. Solution 3: Name output files with specific suffixes

In this solution, output files to be saved by IRIDA Next have specific suffixes which are used to define file type/constrain selection in a pipeline.

Specifically:

  • *.fastq.gz (or *.fq.gz): Defines reads (fastq format).
  • *.fasta.gz: An assembled genome (could also be *.assembly.fasta.gz).
  • *.mlst.json.gz: MLST allele profiles in JSON format.

The iridanext.output.json.gz would list the files with the appropriate names. That is:

iridanext.output.json

{
    "files": {
        "global": [ ],
        "samples": {
            "SampleA": [
                {"path": "reads/SampleA.fastq.gz"},
                {"path": "assembly/SampleA.assembly.fasta.gz"},
                {"path": "mlst/SampleA.mlst.json.gz"}
            ]
        }
    },
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant