Conversation
main.py
Outdated
There was a problem hiding this comment.
Pretty sure sort_keys makes it slower, the order won't matter when we send it to postgres
EDIT: indent also makes the resulting STDOUT bigger.
There was a problem hiding this comment.
I have turned PP on and sorting for debugging purposes. They will be turned off when we put this into production.
|
Please fix the Travis tests. |
|
I'll give it a shot in a few. |
|
FINALLY |
|
Some late night hacking here... I think I may have found a way to speed up Genbank parsing with BioPython... Switch from the current CPython interpreter to the Pypy JIT interpreter. Major speed improvements, no code changes required. Just a switch in interpreters. |
time gzip -cd complete.1.genomic.gbff.gz | python main.py -f genbankOn Python 2.7.7 On Python 2.7.6 (32f35069a16d, Jun 06 2014, 20:12:47) time gzip -cd complete.1.genomic.gbff.gz | python main.py -f genbank
real 0m4.990s
user 0m4.429s
sys 0m0.253s |
|
There are similar speedups on CentOS with pypy PyPy 2.3.1 on CentOS 6(.venv-pypy)$ python --version
Python 2.7.6 (32f35069a16d819b58c1b6efb17c44e3e53397b2, Jun 16 2014, 18:15:00)
[PyPy 2.3.1 with GCC 4.9.0]
(.venv-pypy)bash-4.1$ time gzip -cd ~/refseq/release/complete/complete.1.genomic.gbff.gz | python main.py -f genbank > /disk/scratch/metadata1.json
real 0m6.557s
user 0m6.540s
sys 0m0.305sPython 2.7.7 on CentOS 6(.venv277)$ python --version
Python 2.7.7
(.venv277)$ time gzip -cd ~/refseq/release/complete/complete.1.genomic.gbff.gz | python main.py -f genbank > /disk/scratch/metadata2.json
real 0m14.416s
user 0m14.524s
sys 0m0.218s |
Exposed Fast C counter to python
The counter was written in C.
It was compiled to a shared library 'counterc.so'
The shared library was exposed to python using ctypes.
The C function returns a C struct which is eventually made into a Python dictionary.
main.py has been refactored to use the new counter.
bench.py is provided to test the counter's speed.
The makefile has not been tested with linux, would appreciate some testing there.
C-algorithms is required in --prefix="$HOME"
Here is some sample output.