ADT Parser uses excesive memory on medium-large files #3

mikea1729 · 2016-12-12T19:44:21Z

The current ADT parser implementation in bap.bir.loads() used by default by bap.run() simply does a Python eval(). There are severe performance problems with CPython's parser which make this fail on long or "deep" output from BAP. Depending on paging and memory availability a user may get by, but the "no-eval" implementation in this PR does much better.

Even just trying to use the Python adt module to parse a long string from BAP hits this issue, so the best solution seemed to be to write a parser for the string that comes from BAP, which is really a subset of what Python's eval() does. Though I go out of the way to avoid eval even on substrings since it is generally slower, and uses more memory. I've done a few different implementations, landing on this final one as the best performer. I have not made changes to any existing interfaces, but it's also worth considering updating the bap.adt.ADT.__repr__() implementation to be more in sync with BAP's output for easier testing or inspection at some point. My tests monkey patch it so I can evaluate large files and compare the resulting Project with what comes in from BAP.

You can easily reproduce the problem by compiling a minimal C program statically. (ie. just int main() { return 0; } compiled with gcc -static main.c and then bap.run('a.out')). In my tests the text output from BAP is on the order of 100 MB and has tuples or function calls that go over 80 levels deep. Memory usage gets over 4GB pretty quickly. Both the size and depth can make the CPython parser fail. This new parser succeeds and the extra parser overhead should be more or less minimal. Can't do anything about the size of the Python objects themselves, but at least the parsing doesn't add any more bloat. Tests can be run with pytest or tox, and --slow argument for pytest to run the 100MB test as described above. Everything works on Python 2 and 3. Future optimizations might be to optionally use Cython, but I didn't have time to consider that.

I also made some fixes to the directory layout and .gitignore to make development and testing easier and in line with other open source projects. If you want me to put those in another PR, that's easy since the commits are separate. I thought I'd put this all in one place to start since testing this relies on those commits anyway and Github doesn't do dependencies on pull requests.

Move src/* to src/bap and update setup.py package_dir for workaround With this, python setup.py develop and pip install -e both work again

Also includes test related files and directories Much of this comes from Github:karan/joe

Includes tests with both --map-terms-using and --map-terms-with

ivg · 2017-06-12T08:09:44Z

Thanks a lot, this is a great contribution. Sorry for not merging this for a long, time, besides, I was actually using this branch for myself :) We will release this version to pip soon.

mikea1729 added 6 commits December 5, 2016 13:33

Work around setuptools issue 230 to ease dev

81e1a9a

Move src/* to src/bap and update setup.py package_dir for workaround With this, python setup.py develop and pip install -e both work again

Add .gitignore with standard Python files

ab38d35

Also includes test related files and directories Much of this comes from Github:karan/joe

Add bap.noeval_parser, tests, pytest/tox configs

2f6c2ac

Make the no-eval adt parser the default

ba46831

Handle escaped double-quote in strings

d336df5

Add tests with manual attributes with escapes

95d5e56

Includes tests with both --map-terms-using and --map-terms-with

ivg merged commit 925f2b1 into BinaryAnalysisPlatform:master Jun 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ADT Parser uses excesive memory on medium-large files #3

ADT Parser uses excesive memory on medium-large files #3

Uh oh!

mikea1729 commented Dec 12, 2016

Uh oh!

ivg commented Jun 12, 2017

Uh oh!

Uh oh!

ADT Parser uses excesive memory on medium-large files #3

ADT Parser uses excesive memory on medium-large files #3

Uh oh!

Conversation

mikea1729 commented Dec 12, 2016

Uh oh!

ivg commented Jun 12, 2017

Uh oh!

Uh oh!