Skip to content

Latest commit

 

History

History
735 lines (484 loc) · 21 KB

README.md

File metadata and controls

735 lines (484 loc) · 21 KB

@gmod/gff

Build Status

Read and write GFF3 data performantly. This module aims to be a complete implementation of the GFF3 specification.

NOTE: this module uses the NPM stream package, which requires node.js polyfills for use on the web. We also created the https://github.com/cmdcolin/gff-nostream package to allow a non-streaming version that does not require polyfills

  • streaming parsing and streaming formatting
  • proper escaping and unescaping of attribute and column values
  • supports features with multiple locations and features with multiple parents
  • reconstructs feature hierarchies of both Parent and Derives_from relationships
  • parses FASTA sections
  • does no validation except for referential integrity of Parent and Derives_from relationships (can disable Derives_from reference checking with disableDerivesFromReferences)
  • only compatible with GFF3

Install

$ npm install --save @gmod/gff

Usage

const gff = require('@gmod/gff').default
// or in ES6 (recommended)
import gff from '@gmod/gff'

const fs = require('fs')

// parse a file from a file name
// parses only features and sequences by default,
// set options to parse directives and/or comments
fs.createReadStream('path/to/my/file.gff3')
  .pipe(gff.parseStream({ parseAll: true }))
  .on('data', (data) => {
    if (data.directive) {
      console.log('got a directive', data)
    } else if (data.comment) {
      console.log('got a comment', data)
    } else if (data.sequence) {
      console.log('got a sequence from a FASTA section')
    } else {
      console.log('got a feature', data)
    }
  })

// parse a string of gff3 synchronously
const stringOfGFF3 = fs.readFileSync('my_annotations.gff3').toString()
const arrayOfThings = gff.parseStringSync(stringOfGFF3)

// format an array of items to a string
const newStringOfGFF3 = gff.formatSync(arrayOfThings)

// format a stream of things to a stream of text.
// inserts sync marks automatically.
myStreamOfGFF3Objects
  .pipe(gff.formatStream())
  .pipe(fs.createWriteStream('my_new.gff3'))

// format a stream of things and write it to
// a gff3 file. inserts sync marks and a
// '##gff-version 3' header if one is not
// already present
gff.formatFile(
  myStreamOfGFF3Objects,
  fs.createWriteStream('my_new_2.gff3', { encoding: 'utf8' }),
)

Object format

features

In GFF3, features can have more than one location. We parse features as arrayrefs of all the lines that share that feature's ID. Values that are . in the GFF3 are null in the output.

A simple feature that's located in just one place:

[
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "gene",
    "start": 1000,
    "end": 9000,
    "score": null,
    "strand": "+",
    "phase": null,
    "attributes": {
      "ID": ["gene00001"],
      "Name": ["EDEN"]
    },
    "child_features": [],
    "derived_features": []
  }
]

A CDS called cds00001 located in two places:

[
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "CDS",
    "start": 1201,
    "end": 1500,
    "score": null,
    "strand": "+",
    "phase": "0",
    "attributes": {
      "ID": ["cds00001"],
      "Parent": ["mRNA00001"]
    },
    "child_features": [],
    "derived_features": []
  },
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "CDS",
    "start": 3000,
    "end": 3902,
    "score": null,
    "strand": "+",
    "phase": "0",
    "attributes": {
      "ID": ["cds00001"],
      "Parent": ["mRNA00001"]
    },
    "child_features": [],
    "derived_features": []
  }
]

directives

parseDirective("##gff-version 3\n")
// returns
{
  "directive": "gff-version",
  "value": "3"
}
parseDirective('##sequence-region ctg123 1 1497228\n')
// returns
{
  "directive": "sequence-region",
  "value": "ctg123 1 1497228",
  "seq_id": "ctg123",
  "start": "1",
  "end": "1497228"
}

comments

parseComment('# hi this is a comment\n')
// returns
{
  "comment": "hi this is a comment"
}

sequences

These come from any embedded ##FASTA section in the GFF3 file.

parseSequences(`##FASTA
>ctgA test contig
ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA`)[
  // returns
  {
    id: 'ctgA',
    description: 'test contig',
    sequence: 'ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA',
  }
]

API

Table of Contents

ParseOptions

Parser options

disableDerivesFromReferences

Whether to resolve references to derives from features

Type: boolean

encoding

Text encoding of the input GFF3. default 'utf8'

Type: BufferEncoding

parseFeatures

Whether to parse features, default true

Type: boolean

parseDirectives

Whether to parse directives, default false

Type: boolean

parseComments

Whether to parse comments, default false

Type: boolean

parseSequences

Whether to parse sequences, default true

Type: boolean

parseAll

Parse all features, directives, comments, and sequences. Overrides other parsing options. Default false.

Type: boolean

bufferSize

Maximum number of GFF3 lines to buffer, default 1000

Type: number

parseStream

Parse a stream of text data into a stream of feature, directive, comment, an sequence objects.

Parameters

  • options ParseOptions Parsing options (optional, default {})

Returns GFFTransform stream (in objectMode) of parsed items

parseStringSync

Synchronously parse a string containing GFF3 and return an array of the parsed items.

Parameters

  • str string GFF3 string
  • inputOptions ({disableDerivesFromReferences: boolean?, encoding: BufferEncoding?, bufferSize: number?} | undefined)? Parsing options

Returns Array<(GFF3Feature | GFF3Sequence)> array of parsed features, directives, comments and/or sequences

formatSync

Format an array of GFF3 items (features,directives,comments) into string of GFF3. Does not insert synchronization (###) marks.

Parameters

  • items Array<GFF3Item> Array of features, directives, comments and/or sequences

Returns string the formatted GFF3

formatStream

Format a stream of features, directives, comments and/or sequences into a stream of GFF3 text.

Inserts synchronization (###) marks automatically.

Parameters

  • options FormatOptions parser options (optional, default {})

Returns FormattingTransform

formatFile

Format a stream of features, directives, comments and/or sequences into a GFF3 file and write it to the filesystem.

Inserts synchronization (###) marks and a ##gff-version directive automatically (if one is not already present).

Parameters

  • stream Readable the stream to write to the file
  • writeStream Writable
  • options FormatOptions parser options (optional, default {})
  • filename the file path to write to

Returns Promise<null> promise for null that resolves when the stream has been written

About util

There is also a util module that contains super-low-level functions for dealing with lines and parts of lines.

// non-ES6
const util = require('@gmod/gff').default.util
// or, with ES6
import gff from '@gmod/gff'
const util = gff.util

const gff3Lines = util.formatItem({
  seq_id: 'ctgA',
  ...
}))

util

Table of Contents

unescape

Unescape a string value used in a GFF3 attribute.

Parameters

  • stringVal string Escaped GFF3 string value

Returns string An unescaped string value

escape

Escape a value for use in a GFF3 attribute value.

Parameters

Returns string An escaped string value

escapeColumn

Escape a value for use in a GFF3 column value.

Parameters

Returns string An escaped column value

parseAttributes

Parse the 9th column (attributes) of a GFF3 feature line.

Parameters

  • attrString string String of GFF3 9th column

Returns GFF3Attributes Parsed attributes

parseFeature

Parse a GFF3 feature line

Parameters

  • line string GFF3 feature line

Returns GFF3FeatureLine The parsed feature

parseDirective

Parse a GFF3 directive line.

Parameters

  • line string GFF3 directive line

Returns (GFF3Directive | GFF3SequenceRegionDirective | GFF3GenomeBuildDirective | null) The parsed directive

formatAttributes

Format an attributes object into a string suitable for the 9th column of GFF3.

Parameters

Returns string GFF3 9th column string

formatFeature

Format a feature object or array of feature objects into one or more lines of GFF3.

Parameters

Returns string A string of one or more GFF3 lines

formatDirective

Format a directive into a line of GFF3.

Parameters

Returns string A directive line string

formatComment

Format a comment into a GFF3 comment. Yes I know this is just adding a # and a newline.

Parameters

Returns string A comment line string

formatSequence

Format a sequence object as FASTA

Parameters

Returns string Formatted single FASTA sequence string

formatItem

Format a directive, comment, sequence, or feature, or array of such items, into one or more lines of GFF3.

Parameters

Returns (string | Array<string>) A formatted string or array of strings

GFF3Attributes

A record of GFF3 attribute identifiers and the values of those identifiers

Type: Record<string, (Array<string> | undefined)>

GFF3FeatureLine

A representation of a single line of a GFF3 file

seq_id

The ID of the landmark used to establish the coordinate system for the current feature

Type: (string | null)

source

A free text qualifier intended to describe the algorithm or operating procedure that generated this feature

Type: (string | null)

type

The type of the feature

Type: (string | null)

start

The start coordinates of the feature

Type: (number | null)

end

The end coordinates of the feature

Type: (number | null)

score

The score of the feature

Type: (number | null)

strand

The strand of the feature

Type: (string | null)

phase

For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end of the current CDS feature

Type: (string | null)

attributes

Feature attributes

Type: (GFF3Attributes | null)

GFF3FeatureLineWithRefs

Extends GFF3FeatureLine

A GFF3 Feature line that includes references to other features defined in their "Parent" or "Derives_from" attributes

child_features

An array of child features

Type: Array<GFF3Feature>

derived_features

An array of features derived from this feature

Type: Array<GFF3Feature>

GFF3Feature

A GFF3 feature, which may include multiple individual feature lines

Type: Array<GFF3FeatureLineWithRefs>

GFF3Directive

A GFF3 directive

directive

The name of the directive

Type: string

value

The string value of the directive

Type: string

GFF3SequenceRegionDirective

Extends GFF3Directive

A GFF3 sequence-region directive

value

The string value of the directive

Type: string

seq_id

The sequence ID parsed from the directive

Type: string

start

The sequence start parsed from the directive

Type: string

end

The sequence end parsed from the directive

Type: string

GFF3GenomeBuildDirective

Extends GFF3Directive

A GFF3 genome-build directive

value

The string value of the directive

Type: string

source

The genome build source parsed from the directive

Type: string

buildName

The genome build name parsed from the directive

Type: string

GFF3Comment

A GFF3 comment

comment

The text of the comment

Type: string

GFF3Sequence

A GFF3 FASTA single sequence

id

The ID of the sequence

Type: string

description

The description of the sequence

Type: string

sequence

The sequence

Type: string

License

MIT © Robert Buels