ADT Parser uses excesive memory on medium-large files #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current ADT parser implementation in
bap.bir.loads()
used by default bybap.run()
simply does a Pythoneval()
. There are severe performance problems with CPython's parser which make this fail on long or "deep" output from BAP. Depending on paging and memory availability a user may get by, but the "no-eval" implementation in this PR does much better.Even just trying to use the Python
adt
module to parse a long string from BAP hits this issue, so the best solution seemed to be to write a parser for the string that comes from BAP, which is really a subset of what Python'seval()
does. Though I go out of the way to avoid eval even on substrings since it is generally slower, and uses more memory. I've done a few different implementations, landing on this final one as the best performer. I have not made changes to any existing interfaces, but it's also worth considering updating thebap.adt.ADT.__repr__()
implementation to be more in sync with BAP's output for easier testing or inspection at some point. My tests monkey patch it so I can evaluate large files and compare the resulting Project with what comes in from BAP.You can easily reproduce the problem by compiling a minimal C program statically. (ie. just
int main() { return 0; }
compiled withgcc -static main.c
and thenbap.run('a.out')
). In my tests the text output from BAP is on the order of 100 MB and has tuples or function calls that go over 80 levels deep. Memory usage gets over 4GB pretty quickly. Both the size and depth can make the CPython parser fail. This new parser succeeds and the extra parser overhead should be more or less minimal. Can't do anything about the size of the Python objects themselves, but at least the parsing doesn't add any more bloat. Tests can be run with pytest or tox, and--slow
argument for pytest to run the 100MB test as described above. Everything works on Python 2 and 3. Future optimizations might be to optionally use Cython, but I didn't have time to consider that.I also made some fixes to the directory layout and
.gitignore
to make development and testing easier and in line with other open source projects. If you want me to put those in another PR, that's easy since the commits are separate. I thought I'd put this all in one place to start since testing this relies on those commits anyway and Github doesn't do dependencies on pull requests.