pandas-charm is a small Python package for getting character matrices (alignments) into and out of pandas. Use this library to make pandas interoperable with BioPython and DendroPy.
Convert between the following objects:
- BioPython MultipleSeqAlignment <-> pandas DataFrame
- DendroPy CharacterMatrix <-> pandas DataFrame
- "Sequence dictionary" <-> pandas DataFrame
The code has been tested with Python 2.7, 3.5 and 3.6.
Source repository: https://github.com/jmenglund/pandas-charm
Table of contents
For most users, the easiest way is probably to install the latest version hosted on PyPI:
$ pip install pandas-charm
The project is hosted at https://github.com/jmenglund/pandas-charm and can also be installed using git:
$ git clone https://github.com/jmenglund/pandas-charm.git $ cd pandas-charm $ python setup.py install
You may consider installing pandas-charm and its required Python packages within a virtual environment in order to avoid cluttering your system's Python path. See for example the environment management system conda or the package virtualenv.
Testing is carried out with pytest:
$ pytest -v test_pandascharm.py
Test coverage can be calculated with Coverage.py using the following commands:
$ coverage run -m pytest $ coverage report -m pandascharm.py
The code follow style conventions in PEP8, which can be checked with pycodestyle:
$ pycodestyle pandascharm.py test_pandascharm.py setup.py
The following examples show how to use pandas-charm. The examples are written with Python 3 code, but pandas-charm should work also with Python 2.7+. You need to install BioPython and/or DendroPy manually before you start:
$ pip install biopython $ pip install dendropy
>>> import pandas as pd
>>> import pandascharm as pc
>>> import dendropy
>>> dna_string = '3 5\nt1 TCCAA\nt2 TGCAA\nt3 TG-AA\n'
>>> print(dna_string)
3 5
t1 TCCAA
t2 TGCAA
t3 TG-AA
>>> matrix = dendropy.DnaCharacterMatrix.get(
... data=dna_string, schema='phylip')
>>> df = pc.from_charmatrix(matrix)
>>> df
t1 t2 t3
0 T T T
1 C G G
2 C C -
3 A A A
4 A A A
By default, characters are stored as rows and sequences as columns in the DataFrame. If you want rows to hold sequences, just transpose the matrix in pandas:
>>> df.transpose()
0 1 2 3 4
t1 T C C A A
t2 T G C A A
t3 T G - A A
>>> import pandas as pd
>>> import pandascharm as pc
>>> import dendropy
>>> df = pd.DataFrame({
... 't1': ['T', 'C', 'C', 'A', 'A'],
... 't2': ['T', 'G', 'C', 'A', 'A'],
... 't3': ['T', 'G', '-', 'A', 'A']})
>>> df
t1 t2 t3
0 T T T
1 C G G
2 C C -
3 A A A
4 A A A
>>> matrix = pc.to_charmatrix(df, data_type='dna')
>>> print(matrix.as_string('phylip'))
3 5
t1 TCCAA
t2 TGCAA
t3 TG-AA
>>> from io import StringIO
>>> import pandas as pd
>>> import pandascharm as pc
>>> from Bio import AlignIO
>>> dna_string = '3 5\nt1 TCCAA\nt2 TGCAA\nt3 TG-AA\n'
>>> f = StringIO(dna_string) # make the string a file-like object
>>> alignment = AlignIO.read(f, 'phylip-relaxed')
>>> print(alignment)
SingleLetterAlphabet() alignment with 3 rows and 5 columns
TCCAA t1
TGCAA t2
TG-AA t3
>>> df = pc.from_bioalignment(alignment)
>>> df
t1 t2 t3
0 T T T
1 C G G
2 C C -
3 A A A
4 A A A
>>> import pandas as pd
>>> import pandascharm as pc
>>> import Bio
>>> df = pd.DataFrame({
... 't1': ['T', 'C', 'C', 'A', 'A'],
... 't2': ['T', 'G', 'C', 'A', 'A'],
... 't3': ['T', 'G', '-', 'A', 'A']})
>>> df
t1 t2 t3
0 T T T
1 C G G
2 C C -
3 A A A
4 A A A
>>> alignment = pc.to_bioalignment(df, alphabet='generic_dna')
>>> print(alignment)
SingleLetterAlphabet() alignment with 3 rows and 5 columns
TCCAA t1
TGCAA t2
TG-AA t3
>>> import pandas as pd
>>> import pandascharm as pc
>>> d = {
... 't1': 'TCCAA',
... 't2': 'TGCAA',
... 't3': 'TG-AA'
... }
>>> df = pc.from_sequence_dict(d)
>>> df
t1 t2 t3
0 T T T
1 C G G
2 C C -
3 A A A
4 A A A
>>> import pandas as pd
>>> import pandascharm as pc
>>> df = pd.DataFrame({
... 't1': ['T', 'C', 'C', 'A', 'A'],
... 't2': ['T', 'G', 'C', 'A', 'A'],
... 't3': ['T', 'G', '-', 'A', 'A']})
>>> pc.to_sequence_dict(df)
{'t1': 'TCCAA', 't2': 'TGCAA', 't3': 'TG-AA'}
pandas-charm got its name from the pandas library plus an acronym for CHARacter Matrix.
pandas-charm is distributed under the MIT license.
If you use results produced with this package in a scientific publication, please just mention the package name in the text and cite the Zenodo DOI of this project:
Choose your preferred citation style in the "Cite as" section on the Zenodo page.
Markus Englund, orcid.org/0000-0003-1688-7112