Skip to content

Exposed Fast C counter to python#2

Merged
blkperl merged 12 commits intomasterfrom
counterc
Jun 16, 2014
Merged

Exposed Fast C counter to python#2
blkperl merged 12 commits intomasterfrom
counterc

Conversation

@ptseng
Copy link
Copy Markdown
Member

@ptseng ptseng commented Jun 11, 2014

The counter was written in C.
It was compiled to a shared library 'counterc.so'
The shared library was exposed to python using ctypes.
The C function returns a C struct which is eventually made into a Python dictionary.
main.py has been refactored to use the new counter.
bench.py is provided to test the counter's speed.

The makefile has not been tested with linux, would appreciate some testing there.
C-algorithms is required in --prefix="$HOME"

Here is some sample output.

gzip -cd complete.1.genomic.gbff.gz | python main.py -f genbank
[
  {
    "annotations": "{'comment': '\\nContig AARG01000015 was replaced by the complete pFN3 plasmid in\\nCP000710.', 'sequence_version': 1, 'source': 'Fusobacterium nucleatum subsp. polymorphum ATCC 10953', 'taxonomy': ['Bacteria', 'Fusobacteria', 'Fusobacteriales', 'Fusobacteriaceae', 'Fusobacterium'], 'keywords': ['WGS', 'RefSeq'], 'references': [Reference(title='Direct Submission', ...)], 'accessions': ['NZ_AARG01000001', 'NZ_AARG00000000'], 'data_file_division': 'BCT', 'date': '17-FEB-2009', 'organism': 'Fusobacterium nucleatum subsp. polymorphum ATCC 10953', 'gi': '110592360'}", 
    "codoncount": {
      "AAA": 361, 
      "AAC": 52, 
      "AAG": 183, 
      "AAT": 258, 
      "ACA": 58, 
      "ACC": 13, 
      "ACG": 6, 
      "ACT": 67, 
      "AGA": 99, 
      "AGC": 35, 
      "AGG": 44, 
      "AGT": 84, 
      "ATA": 241, 
      "ATC": 39, 
      "ATG": 114, 
      "ATT": 204, 
      "CAA": 91, 
      "CAC": 15, 
      "CAG": 50, 
      "CAT": 43, 
      "CCA": 18, 
      "CCC": 5, 
      "CCG": 0, 
      "CCT": 20, 
      "CGA": 7, 
      "CGC": 1, 
      "CGG": 0, 
      "CGT": 3, 
...
cat cow.seq | python bench.py 
{'CTT': 578526, 'TAG': 385361, 'ACA': 570762, 'ACG': 59782, 'ATC': 382642, 'AAC': 422809, 'ATA': 646304, 'AGG': 458186, 'CCT': 455130, 'ACT': 463769, 'AGC': 365433, 'AAG': 577313, 'AGA': 627866, 'CAT': 531415, 'AAT': 770510, 'ATT': 773553, 'CTG': 521121, 'CTA': 383136, 'CTC': 432919, 'CAC': 387600, 'AAA': 1160703, 'CCG': 58156, 'AGT': 463957, 'CCA': 478747, 'CAA': 546849, 'CCC': 310692, 'TAT': 647698, 'GGT': 306196, 'TGT': 574118, 'CGA': 52125, 'CAG': 520075, 'TCT': 628490, 'GAT': 382211, 'CGG': 57943, 'TTT': 1169127, 'TGC': 381261, 'GGG': 310647, 'TGA': 556172, 'GGA': 403467, 'TGG': 479804, 'GGC': 285218, 'TAC': 338345, 'TTC': 572415, 'TCG': 52363, 'TTA': 653679, 'TTG': 551344, 'TCC': 402091, 'ACC': 306416, 'TAA': 650744, 'GCA': 380165, 'GTA': 339118, 'GCC': 284110, 'GTC': 251551, 'GCG': 50561, 'GTG': 389988, 'GAG': 434060, 'GTT': 426786, 'GCT': 365314, 'GAC': 249625, 'CGT': 59720, 'GAA': 571097, 'TCA': 557273, 'ATG': 531010, 'CGC': 50561}
Tokenize and Count in 594 ms

@ptseng ptseng assigned blkperl and rkonell and unassigned blkperl and rkonell Jun 11, 2014
main.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure sort_keys makes it slower, the order won't matter when we send it to postgres

EDIT: indent also makes the resulting STDOUT bigger.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have turned PP on and sorting for debugging purposes. They will be turned off when we put this into production.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@blkperl
Copy link
Copy Markdown
Contributor

blkperl commented Jun 11, 2014

Please fix the Travis tests.

@ptseng
Copy link
Copy Markdown
Member Author

ptseng commented Jun 11, 2014

I'll give it a shot in a few.

@ptseng
Copy link
Copy Markdown
Member Author

ptseng commented Jun 12, 2014

FINALLY

@ptseng
Copy link
Copy Markdown
Member Author

ptseng commented Jun 14, 2014

Some late night hacking here... I think I may have found a way to speed up Genbank parsing with BioPython... Switch from the current CPython interpreter to the Pypy JIT interpreter. Major speed improvements, no code changes required. Just a switch in interpreters.

@ptseng
Copy link
Copy Markdown
Member Author

ptseng commented Jun 14, 2014

time gzip -cd complete.1.genomic.gbff.gz | python main.py -f genbank

On Python 2.7.7

real    0m12.059s
user    0m11.994s
sys 0m0.111s

On Python 2.7.6 (32f35069a16d, Jun 06 2014, 20:12:47)
[PyPy 2.3.1 with GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)]

time gzip -cd complete.1.genomic.gbff.gz | python main.py -f genbank
real    0m4.990s
user    0m4.429s
sys 0m0.253s

@blkperl
Copy link
Copy Markdown
Contributor

blkperl commented Jun 16, 2014

There are similar speedups on CentOS with pypy

PyPy 2.3.1 on CentOS 6

(.venv-pypy)$ python --version
Python 2.7.6 (32f35069a16d819b58c1b6efb17c44e3e53397b2, Jun 16 2014, 18:15:00)
[PyPy 2.3.1 with GCC 4.9.0]

(.venv-pypy)bash-4.1$ time gzip -cd ~/refseq/release/complete/complete.1.genomic.gbff.gz | python main.py -f genbank > /disk/scratch/metadata1.json

real    0m6.557s
user    0m6.540s
sys     0m0.305s

Python 2.7.7 on CentOS 6

(.venv277)$ python --version
Python 2.7.7


(.venv277)$ time gzip -cd ~/refseq/release/complete/complete.1.genomic.gbff.gz | python main.py -f genbank > /disk/scratch/metadata2.json

real    0m14.416s
user    0m14.524s
sys     0m0.218s

blkperl added a commit that referenced this pull request Jun 16, 2014
Exposed Fast C counter to python
@blkperl blkperl merged commit f67757b into master Jun 16, 2014
@blkperl blkperl deleted the counterc branch June 16, 2014 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants