Description
A parser for different file formats is needed, especially for processing csv files that contain several trajectories.
file format | description | priority |
---|---|---|
.xyz | file with several atoms and time points | 1 |
.csv | csv file with several trajectories | 2 |
LAMMPS | Molecular dynamics | 2 |
.pdb | protein data bank | 3 |
.xyz file
Nparticles [integer]
comment [character]
X Y Z [repeat Nparticles]
[repeat Nframes]
CSV with several trajectories - format definition
The csv should contain 5 columns: time t
, 3 spatial (x
, y
, z
) components and the trajectory identifier id
.
LAMMPS data file format
Large-scale Atomic/Molecular Massively Parallel Simulator is a molecular dynamics program from Sandia National Laboratories.
More details about the file format: https://docs.lammps.org/read_data.html
The LAMMPS data dump file format is written in yaml with the following structure:
---
creator: LAMMPS
timestep: 0
units: lj
time: 0
natoms: 3
boundary: [ p, p, p, p, p, p, ]
thermo:
- keywords: [ Step, Temp, E_pair, E_mol, TotEng, Press, ]
- data: [ 0, 0, -27093.472213010766, 0, 0, 0, ]
box:
- [ 0, 16.795961913825074 ]
- [ 0, 16.795961913825074 ]
- [ 0, 16.795961913825074 ]
- [ 0, 0, 0 ]
keywords: [ id, type, x, y, z, vx, vy, vz, ix, iy, iz, ]
data:
- [ 1 , 1 , 0.000000e+00 , 0.000000e+00 , 0.000000e+00 , -1.841579e-01 , -9.710036e-01 , -2.934617e+00 , 0 , 0 , 0, ]
- [ 2 , 1 , 8.397981e-01 , 8.397981e-01 , 0.000000e+00 , -1.799591e+00 , 2.127197e+00 , 2.298572e+00 , 0 , 0 , 0, ]
- [ 3 , 1 , 8.397981e-01 , 0.000000e+00 , 8.397981e-01 , -1.807682e+00 , -9.585130e-01 , 1.605884e+00 , 0 , 0 , 0, ]
---
timestep: 100
...
---
A parser for this file format is straightforward with yaml.load_all()
function.
Protein Data Bank (PDB) format
Standard file format for protein structures containing several atoms each file at different time steps. Each pdb file can contain a screenshot of the system or several trajectories, so we need to process several pdb files at once to extract trajectories.
A possible workflow would be:
- Read each pdb file and extract the trajectories per atom
- Write a CSV file using the format (y, x, y, z, id), where
id
is the atom identifier. - Use the CSV file to compute the features using trajpy
More information about pdb file format: https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)