Skip to content

ADT Parser uses excesive memory on medium-large files #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 12, 2017

Conversation

mikea1729
Copy link
Contributor

The current ADT parser implementation in bap.bir.loads() used by default by bap.run() simply does a Python eval(). There are severe performance problems with CPython's parser which make this fail on long or "deep" output from BAP. Depending on paging and memory availability a user may get by, but the "no-eval" implementation in this PR does much better.

Even just trying to use the Python adt module to parse a long string from BAP hits this issue, so the best solution seemed to be to write a parser for the string that comes from BAP, which is really a subset of what Python's eval() does. Though I go out of the way to avoid eval even on substrings since it is generally slower, and uses more memory. I've done a few different implementations, landing on this final one as the best performer. I have not made changes to any existing interfaces, but it's also worth considering updating the bap.adt.ADT.__repr__() implementation to be more in sync with BAP's output for easier testing or inspection at some point. My tests monkey patch it so I can evaluate large files and compare the resulting Project with what comes in from BAP.

You can easily reproduce the problem by compiling a minimal C program statically. (ie. just int main() { return 0; } compiled with gcc -static main.c and then bap.run('a.out')). In my tests the text output from BAP is on the order of 100 MB and has tuples or function calls that go over 80 levels deep. Memory usage gets over 4GB pretty quickly. Both the size and depth can make the CPython parser fail. This new parser succeeds and the extra parser overhead should be more or less minimal. Can't do anything about the size of the Python objects themselves, but at least the parsing doesn't add any more bloat. Tests can be run with pytest or tox, and --slow argument for pytest to run the 100MB test as described above. Everything works on Python 2 and 3. Future optimizations might be to optionally use Cython, but I didn't have time to consider that.

I also made some fixes to the directory layout and .gitignore to make development and testing easier and in line with other open source projects. If you want me to put those in another PR, that's easy since the commits are separate. I thought I'd put this all in one place to start since testing this relies on those commits anyway and Github doesn't do dependencies on pull requests.

Move src/* to src/bap and update setup.py package_dir for workaround
With this, python setup.py develop and pip install -e both work again
Also includes test related files and directories
Much of this comes from Github:karan/joe
Includes tests with both --map-terms-using and --map-terms-with
@ivg
Copy link
Member

ivg commented Jun 12, 2017

Thanks a lot, this is a great contribution. Sorry for not merging this for a long, time, besides, I was actually using this branch for myself :) We will release this version to pip soon.

@ivg ivg merged commit 925f2b1 into BinaryAnalysisPlatform:master Jun 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants