Skip to content

Database of coordinates #4

@lilyminium

Description

@lilyminium

We need a database of coordinates. These will contain the starting structures of boxes when present (i.e. "pre-equilibrated" structures).

Aims

  • Development of a new class type to handle a structure. It should be able to contain coordinates and mapped_smiles. Ideally it could store metadata such as the thermodynamic state it was computed at and other info.
  • We should be able to store multiple coordinates of the same box.
  • Some kind of human-readable overview and interface to allow us to modify or understand what's in this database easily. Ideally it would be searchable based on molecules (smiles), thermodynamic state, force field, etc.
  • Logic to allow us to take a force field and evaluate the energy of a particular box, so we can select for the lowest energy one
  • Logic to allow us to select based on metadata, e.g. allowing us to take a box generated by the previous iteration of fitting a FF, as our starting point
  • Easy conversion, saving and retrieval of coordinates via API
  • PDB compatibility
  • Ideally -- distribution via github or zenodo, or a package that people could download and contribute to.

User stories

  • As someone preparing a fit with limited disk space, I want to select a subset of an existing database for my starting structures. In my database containing coordinates generated by multiple force fields, I only want the ones that were generated with Sage 2.1 and contain 1000 molecules.
  • As someone preparing a test fit with limited disk space, I want to select a subset of an existing database for my starting structures. I only want to select the boxes that correspond with my small test dataset.
  • As a forgetful human being, I can't remember if this database contains a box of ethanol, and I want to easily figure that out.
  • As an automated fitting routine, I need to easily query whether a box is present in the database, or whether I need to pre-equilibrate the structure first.
  • As someone experimenting with a new dataset, I want to easily work out how much of this dataset already has structures in the database, and which new boxes need to be simulated.
  • As someone running a FF fit with 15 iterations, I want to start each simulation from the coordinates of the previous FF.
  • As someone starting a new fit, I want to start each simulation from the coordinates with the lowest energy by the FF version I'm using at the time.
  • As someone who has simulated many systems in GROMACS, I want to contribute some structures back to the database. I've saved them as a PDB but I have the valid SMILES needed to build a topology.
  • As someone accustomed to working with PDBs, I want an easy way to look at coordinates as a PDB.
  • As someone who runs lots of fits, I want to easily combine two databases but deduplicate identical boxes.
  • As someone who's finished a fit, I want to easily upload my database of stored structures somewhere.
  • As a good HPC citizen, I want to continuously add to the database, without causing issues with limits on the number of files.

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions