Skip to content

Commit

Permalink
First version of library and linter (#1)
Browse files Browse the repository at this point in the history
  • Loading branch information
dweindl authored Nov 28, 2018
1 parent c68c167 commit d732209
Show file tree
Hide file tree
Showing 11 changed files with 1,161 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,6 @@ venv.bak/

# mypy
.mypy_cache/

# Pycharm
.idea
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# PEtab --- a data format for specifying parameter estimation problems in system biology

This repository describes *PEtab* --- a data format for specifying parameter estimation problems in system biology, provides a Python library for easy access and validation of *PEtab* files. See [doc/documentation_data_format.md](doc/documentation_data_format.md) for more info.

Where PEtab is used:

- Systems biology optimization [benchmark problem collection](https://github.com/LoosC/Benchmark-Models/issues)
- [pyPESTO](https://github.com/ICB-DCM/pyPESTO/)


62 changes: 62 additions & 0 deletions bin/petablint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/env python3

"""Command line tool to check for correct format"""

import argparse
import sys
import os
sys.path.append(os.path.split(os.path.dirname(os.path.realpath(__file__)))[0])
import petab


def parse_cli_args():
"""Parse command line argumentss"""

parser = argparse.ArgumentParser(
description='Check if a set of files adheres to the PEtab format.')

parser.add_argument('-s', '--sbml', dest='sbml_file_name',
help='SBML model filename', required=True)
parser.add_argument('-m', '--measurements', dest='measurement_file_name',
help='Measurement table', required=True)
parser.add_argument('-c', '--conditions', dest='condition_file_name',
help='Conditions table', required=True)
parser.add_argument('-p', dest='parameter_file_name',
help='Parameter table', required=True)

args = parser.parse_args()

return args


def main():
args = parse_cli_args()

problem = petab.OptimizationProblem(args.sbml_file_name,
args.measurement_file_name,
args.condition_file_name,
args.parameter_file_name)

mandatory_measurement_fields = {'observableId',
# 'preequilibrationConditionId',
'simulationConditionId',
'measurement',
'time',
# 'observableParameters',
# 'noiseParameters',
# 'observableTransformation',
}
missing_fields = mandatory_measurement_fields - set(problem.measurement_df.columns.values)
if missing_fields:
raise AssertionError(f'Missing measurements table fields {missing_fields}')

# TODO: extend

petab.lint.assert_measured_observables_present_in_model(problem.measurement_df, problem.sbml_model)
petab.lint.assert_overrides_match_parameter_count(problem.measurement_df,
petab.get_observables(problem.sbml_model, remove=False),
petab.get_sigmas(problem.sbml_model, remove=False))


if __name__ == '__main__':
main()
240 changes: 240 additions & 0 deletions doc/documentation_data_format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# Optimization problem data format specification

This document explains the data format used for the benchmark collection.


## Purpose

Providing a standardized way for specifying parameter estimation problems in systems biology, especially for the case of Ordinary Differential Equation (ODE) models.


## Overview

This data format specifies a parameter estimation problems using a number of text-based files ([Systems Biology Markup Language (SBML)](http://sbml.org) and [Tab-Separated Values (TSV)](https://www.iana.org/assignments/media-types/text/tab-separated-values)) , i.e.

- An SBML model [SBML]

- A measurement file to fit the model to [TSV]

- A "condition" file specifying model inputs and condition-specific parameters [TSV]

The following sections will describe the minimum requirements of those components in the core standard, which should provide all information for defining the parameter estimation problem.

Extensions of this format (e.g. additional columns in the measurement table) are possible and intended. However, those columns should provide extra information for example for plotting, or for more efficient parameter estimation, but they should not change the optimum of the optimization problem. Some optional extensions are described in the last section, "Extensions", of this document.

## SBML model definition

The model must be specified as valid SBML. Since parameter estimation is beyond the scope of SBML, there exists no standard way to specify observables (model outputs) and respective noise models. Therefore, we use the following convention.

### Observables

In the SBML model, observables are specified as `AssignmentRules` assigning to parameters with `id`s starting with `observable_` followed by the `observableId` as in the corresponding column of the *measurement sheet* (see below).

E.g.
```
observable_pErk = observableParameter1 + observableParameter2*pErk
```
where `observableParameter1` would be an offset, and `observableParameter2` a scaling parameter.

### Noise model

Measurement noise can be specified as a numerical value in the `noiseParameters` column of the *measurement sheet* (see below), which will default to a Gaussian noise model with standard deviation as provided in `noiseParameters`.

Alternatively, more complex noise models can be specified for each observable, using additional `AssignmentRules`. Those noise model rules assign to `sigma_${observableId}` parameters.
A noise model which accounts for relative and absolute contributions could, e.g., be defined as
```
sigma_pErk = noiseParameter1 + noiseParameter2*pErk
```
with `noiseParameter1` denoting the absolute and `noiseParameter2` the relative contribution.

Any parameters named `noiseParameter${1..n}` *must* be overwritten in the `noiseParameters` column of the measurement file (see below).


## Condition table

The condition table species parameters for specific simulation condition (generally corresponding to different experimental conditions).

This is specified as tab-separated value file with condition-specific parameters in the following way:

| conditionId | conditionName | parameterId1 | ... | parameterId${n} |
|---|---|---|---|---|
| conditionId1 | conditionName1 | NUMERIC|parameterId | ...| ...
| conditionId2 | ... | ... | ...| ...
|... | ... | ... | ... |...| ...

Row names are condition names as referenced in the measurement table below. Column names are parameter names as given in the SBML model. These parameters will override any parameter values specified in the model.

Row- and column-ordering **are(?) arbitrary**. The `conditionName` column is optional. Additional columns **are|are not** allowed.


**TODO** This will no longer be the input table, what else is allowed here?

**TODO** Do we allow specifying initial conditions anywhere?

**TODO** Are additional columns allowed here? With special prefix?

**TODO** Related; Should Column-order matter?

**TODO** Do we allow non-sbml-model parameters here, as introduced in noiseParameters or observable parameters?

**TODO**: do we allow state names here, which will overwrite their dynamics?


## Measurements table

A tab-separated values files containing all measurements to be used for
model training or validation.

Expected to have the following named columns in any (but preferably this) order:

| observableId | preequilibrationConditionId | simulationConditionId | measurement | ...
|---|---|---|---|---|
| observableId | [conditionId] | conditionId | NUMERIC |
|...|...|...|...|...|

*(wrapped for readability)*

| ... | time | observableParameters | noiseParameters | observableTransformation |
|---|---|---|---|---|
|... | NUMERIC|'inf' | [parameterId|NUMERIC[;parameterId|NUMERIC][...]] | [parameterId|NUMERIC[;parameterId|NUMERIC][...]] | ['lin'(default)|'log'|'log10']
|...|...|...|...|...|

Fields in "[]" in the second row are optional and may be left empty.

Additional (non-standard) columns may be added.

**TODO** All entries case-sensitive? column names, string values, sbml ids ?

### Detailed field description

- `observableId` [STRING, NOT NULL, REFERENCES(sbml.observableID)]

Observable name matching the ID of the respective parameter in the SBML
model (`observable_...`)

**TODO** What is the observable ID: observable_bla or only the latter part?

- `preequilibrationConditionId` [STRING OR NULL, REFERENCES(conditionsTable.conditionID)]

The `conditionId` to be used for preequilibration. E.g. for drug
treatments the model would be preequilibrated with the no-drug condition.
Empty for no preequlibration.

**TODO** Can preequilibrationConditionId column be omitted if all empty?

- `simulationConditionId` [STRING, NOT NULL, REFERENCES(conditionsTable.conditionID)]

`conditionId` as provided in the condition table, specifying the condition-specific parameters used for simulation.

- `measurement` [NUMERIC, NOT NULL]

The measured value in the same units/scale as the model output.

- `time` [NUMERIC OR STRING, NOT NULL]

Time point of the measurement in the time unit specified in the SBML model, numeric value or `inf` (lower-case) for steady-state measurements.

- `observableParameters` [STRING OR NULL]

This field allows overriding or introducing condition-specific versions of the parameters
defined in the model. The model can define observables (see above) containing place-holder parameters
which can be replaced by condition-specific dynamic or constant parameters.

Placeholder parameters must be named ... **TODO** (with `n` ranging from 0?1? to n, without gaps). If the observable specified under `observableId` contains no placeholder, this field must be empty. If it contains `n > 0` placeholders, this field must hold `n` semicolon-separated numeric values or parameter names. No trailing semicolon should be added.

Different lines for the same `observableId` may specify different parameters. This may be used to account for condition-specific or batch-specific parameters. This will translate into an extended optimization parameter vector.

**TODO** Optional if there are not observables with placeholders?

- `noiseParameters` [STRING]

The measurement standard deviation or `NaN` if the corresponding sigma is a model
parameter.

**TODO** first finalize observableParameters

- `observableTransformation` [STRING]

Transformation of the observable. `lin`, `log` or `log10`.

**TODO** Can observableTransformation column be omitted if all lin?

**TODO** What does this imply?


**TODO** Do we allow for custom cost functions or likelihood for non-normal case?


## Parameter table

A tab-separated value text file containing information on the parameters. One row per parameter with arbitrary order of rows and columns:

| parameterId | parameterName | parameterScale | lowerBound |upperBound | nominalValue | estimate | priorType | priorParameters |
|---|---|---|---|---|---|---|---|---|
|STRING|STRING|log10|lin| **TODO**|NUMERIC|NUMERIC|NUMERIC|0|1|**TODO**|**TODO**
|...|
|...|

Additional columns may be added.

- `parameterId` [STRING, NOT NULL, REFERENCES(sbml.parameterId)]

The `parameterId` of the parameter described in this row. This has be identical to the parameter IDs specified in the SBML model or in the `observableParameters` or `noiseParameters` column of the measurement table (see above).

There must exist one line for each parameterId specified in the SBML model (except for placeholder parameter, see above) or the `observableParameters` or `noiseParameters` column of the measurement table.

- `parameterName` [STRING, OPTIONAL]

Parameter name to be used e.g. for plotting etc. Can be chosen freely. May or may not coincide with the SBML parameter name.

- `parameterScale` [lin|log|log10]

Scale of the parameter. The parameters and boundaries and the nominal parameter value in the following fields are expected to be given in this scale.

**TODO** optional, NULL==lin?

- `lowerBound` [NUMERIC]

Lower bound of the parameter used for optimization.

**TODO** allow inf or empty?

- `upperBound` [NUMERIC]

Upper bound of the parameter used for optimization.

- `nominalValue`

Best values found by optimization or the value used for simulation for artificial data or fixed parameters (see `estimate`).

**TODO** "Best values found by optimization": so they will be updated every time we find something better?

- `estimate` [BOOL 0|1]

1 or 0, depending on, if the parameter is estimated (1) or set to a fixed value(0) (-> `nominalValue`).

- `priorType`

Type of prior. Leave empty, if no priors. Normal/ Laplace etc.

**TODO** optional?

**TODO** what will be allowed?

- `priorParameters`

Parameters for prior.

**TODO** Numeric or also parameter names?


## Extensions

### Parameter table

TODO:

Extra columns
- `hierarchicalOptimization` (optional)

hierarchicalOptimization: 1 if parameter is optimized using hierarchical optimization approach. 0 otherwise.
2 changes: 2 additions & 0 deletions petab/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .core import *
from .lint import *
Loading

0 comments on commit d732209

Please sign in to comment.