Defining metadata for output files for IRIDA Next #14

apetkau · 2024-01-12T02:56:04Z

1. Problem Statement

There is a need to be able to indicate file types for different output files to store in IRIDA Next (e.g., reads, assemblies, mlst allele profiles).

1.1. Adjust metadata associated with files

One set of solutions focuses on adjusting metadata/attributes (such as type and format) in IRIDA Next. There may be a need to extend this metadata to include additional keys (such as scheme for a cg/wgMLST scheme). This needs to be conveyed/defined by a pipeline (or at least, when IRIDA Next stores analysis results files, it needs to set the correct keys to the correct values).

1.2. Adjust file suffixes

Another set of solutions focuses on adjusting output file suffixes (e.g., *.fastq.gz or *.fasta.gz).

2. Solution 0: IRIDA Next defines attributes of files

Here, IRIDA Next defines a set of attributes for attached files based in different criteria (such as file name, etc). This is currently implemented within IRIDA Next. An example of defined attributes can be found at:

https://github.com/phac-nml/irida-next/blob/72e0a6f32f932fa608ebccee0496b7c2d041e2db/test/fixtures/attachments.yml#L34-L43

attachmentPEREV1:
  metadata:
    {
      "type": "pe",
      "format": "fastq",
      "direction": "reverse",
      "compression": "none",
      "associated_attachment_id": <%= ActiveRecord::FixtureSet.identify(:attachmentPEFWD1) %>,
    }
  attachable: sampleB (Sample)

3. Solution 1: Define additional attributes in IRIDA Next

This solution involves adding additional attributes in IRIDA Next for key file formats and types we need to work with. This makes it easier to select these file types within pipelines or from the IRIDA Next API. These attributes will be determined by IRIDA Next.

3.1. Additional formats and types

format="json", type="mlst": For (cg/wg)MLST results stored as a JSON file.
format="fasta", type="assembly": For assembled genomes in fasta format
format="genbank", type="assembly": For assembled/annotated genomes in genbank format

3.2. Additional attributes

In addition to additional format/type attributes, it might make sense to define additional attributes which can be set. In particular:

mlst_scheme: This attribute defines the particular MLST scheme the associated file represents. This could be read from information within the MLST alleles JSON file.

3.3. Advantages

Minimal modifications needed for IRIDA Next

3.4. Disadvantages

Less flexible for extending file attributes in the future
Assignment of attributes determined entirely by pipeline developers

4. Solution 2: Define additional attributes in pipeline

Similar to Solution 1, IRIDA Next will include additional attributes associated with a file (such as mlst_scheme). However, the values of these attributes can be set by a pipeline in the iridanext.output.json output file.

Currently, the iridanext.output.json defines files associated with samples (or with the analysis pipeline as a whole) as a list of JSON objects which includes the key path. This solution would extend this JSON structure to add additional keys associated with files.

4.1. Example

For example, the following could be an iridanext.output.json output file:

iridanext.output.json

{
    "files": {
        "global": [ ],
        "samples": {
            "SampleA": [
                {"path": "assembly/assembly.fa.gz", "type": "assembly"},
                {"path": "mlst/alleles.json.gz", "type": "mlst", "mlst_scheme": "listeria-2024-01-01"}
            ]
        }
    },
}

That is, each file entry has an associated type, or mlst_scheme (or other defined keywords).

4.2. Providing keys to `iridanext.output.json`

If the nf-iridanext plugin was used to write the iridanext.output.json file, then the following Nextflow configuration could possibly be used to create the additional keys.

nextflow.config

iridanext {
    output {
        files {
            samples = [
                ["path": "**/assembly/*.assembly.fa.gz", "type": "assembly"],
                ["path": "**/mlst/alleles.json.gz", "type": "mlst", "mlst_scheme": "${params.scheme}"] 
            ]
        }
    }
}

Here, I assume that in the Nextflow pipeline --scheme is used to define the MLST scheme, which is passed as metadata to generate the final iridanext.output.json file.

4.3. Advantages

Pipeline developers can define the file attributes rather than code located in IRIDA Next.
- This allows each pipeline to customize the type of attributes and values to use.

4.4. Disadvantages

More complicated code changes

4.5. Questions/Caveats

How to handle situations where both IRIDA Next and a pipeline attempt to write to the same attribute?

5. Solution 3: Name output files with specific suffixes

In this solution, output files to be saved by IRIDA Next have specific suffixes which are used to define file type/constrain selection in a pipeline.

Specifically:

*.fastq.gz (or *.fq.gz): Defines reads (fastq format).
*.fasta.gz: An assembled genome (could also be *.assembly.fasta.gz).
*.mlst.json.gz: MLST allele profiles in JSON format.

The iridanext.output.json.gz would list the files with the appropriate names. That is:

iridanext.output.json

{
    "files": {
        "global": [ ],
        "samples": {
            "SampleA": [
                {"path": "reads/SampleA.fastq.gz"},
                {"path": "assembly/SampleA.assembly.fasta.gz"},
                {"path": "mlst/SampleA.mlst.json.gz"}
            ]
        }
    },
}

The text was updated successfully, but these errors were encountered:

apetkau added irida-next-integration output labels Jan 15, 2024

apetkau mentioned this issue Jan 17, 2024

Constraints on input data for IRIDA Next #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining metadata for output files for IRIDA Next #14

Defining metadata for output files for IRIDA Next #14

apetkau commented Jan 12, 2024 •

edited

Loading

Defining metadata for output files for IRIDA Next #14

Defining metadata for output files for IRIDA Next #14

Comments

apetkau commented Jan 12, 2024 • edited Loading

1. Problem Statement

1.1. Adjust metadata associated with files

1.2. Adjust file suffixes

2. Solution 0: IRIDA Next defines attributes of files

3. Solution 1: Define additional attributes in IRIDA Next

3.1. Additional formats and types

3.2. Additional attributes

3.3. Advantages

3.4. Disadvantages

4. Solution 2: Define additional attributes in pipeline

4.1. Example

4.2. Providing keys to iridanext.output.json

4.3. Advantages

4.4. Disadvantages

4.5. Questions/Caveats

5. Solution 3: Name output files with specific suffixes

apetkau commented Jan 12, 2024 •

edited

Loading

4.2. Providing keys to `iridanext.output.json`