Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greatly improve performance #1002

Closed
wants to merge 4 commits into from
Closed

Greatly improve performance #1002

wants to merge 4 commits into from

Conversation

rsBNT
Copy link
Contributor

@rsBNT rsBNT commented Nov 21, 2017

by not checking file tokens but instead using the filecontent directly to compute the checksum.

Tokenizing is expensive so by checking if the file itself changed (alongside the toolinfo data) it can be quickly evaluated if the file needs rechecking.

This of course only works if --cppcheck-build-dir is used and a previous run generated an analyzerinfo file.

This basically implements incremental/differential file checking, so in codebases with few files changing, cppcheck will be incredibly fast.

Caveats i see are:

  • reading the whole file possibly twice
  • the checksum might produce collisions (maybe change to SHA1?)
  • more sophisticated algorithm could be implemented

@Dmitry-Me
Copy link
Collaborator

What are the numbers for the performance improvement?

@rsBNT
Copy link
Contributor Author

rsBNT commented Nov 22, 2017

Running in Release|x64 against a solution with 9 projects and 398 files to check i see the following behaviour (real value from time):

main

initial run (clean build directory): 7m40.298s
first rerun (no changes): 7m22.985s
second rerun (no changes): 7m16.549s

checkDiffOnly

initial run (clean build directory): 7m30.506s
first rerun (no changes): 0m3.551s
second rerun (no changes): 0m3.731s

There might be a bug in how cppcheck treats the files in the build dir currently, i need to investigate further.

@rsBNT
Copy link
Contributor Author

rsBNT commented Nov 23, 2017

I investigated the behaviour and created PR #1005

@gj12
Copy link

gj12 commented Nov 23, 2017

You can use the linux kernel source to test the performance of cppcheck. linux 4.14.1 has 61258 files and 23050598 lines of code.

@rsBNT
Copy link
Contributor Author

rsBNT commented Nov 23, 2017

Do you think it would make any difference? Feel free to check the performance against any codebase you want. Do you have values to compare against?

@danmar
Copy link
Owner

danmar commented Nov 29, 2017

Somehow you did not run old cppcheck correctly. We have incremental analysis already so the second run should be much faster than the first run.

@danmar
Copy link
Owner

danmar commented Nov 29, 2017

numbers I get for "main":

first run: 3m29,520s
second run: 0m2,604s

@rsBNT
Copy link
Contributor Author

rsBNT commented Nov 30, 2017

You are right, when scanning the cppcheck source on branch main (Release|x64, 223 files) via

$ mkdir -p /tmp/cppcheckCache
$ time ./bin/cppcheck.exe -j 8 --inline-suppr --xml-version=2 --cppcheck-build-dir="/tmp/cppcheckCache" . 2>/tmp/cppcheck_result.xml 1>/tmp/cppcheck_output.txt

I get

  1. run: 0m21.169s 0m21.063s 0m20.891s -> avg: 0m21.041s
  2. run: 0m3.474s 0m3.231s 0m3.271s -> avg: 0m3.325s

Running the same on branch checkDiffOnly:

  1. run: 0m23.073s 0m20.917s 0m20.883s -> avg: 0m21.624s
  2. run: 0m2.852s 0m2.846s 0m2.701s -> avg: 0m2.799s

So obviously for the performance improvement (not the incremental issue) a bigger codebase is needed. Running it against linux kernel showed weird behaviour (the checking was done but the process never returned, bug?) so i ran it against the LLVM source (2309 files) which resulted in (measured 3x, wiped the caches in between)

main

  1. run: 4m24.997s 4m37.194s 4m32.455s -> avg: 4m31.549s
  2. run: 0m15.463s 0m12.244s 0m12.403s -> avg: 0m13.370s

checkDiffOnly

  1. run: 4m39.316s 4m35.806s 4m41.700s -> avg: 4m38.941s
  2. run: 0m10.839s 0m12.244s 0m12.464s -> avg: 0m11.849s

To check whether the penalty for the first run (reading file twice) is gone when not using a builddir i ran the same command without --cppcheck-build-dir:

4m33.046s 4m23.384s 4m27.744s -> avg: 4m28.058s

For the incremental issue, i will investigate further, i suspect --project is the culprit.

@rsBNT
Copy link
Contributor Author

rsBNT commented Dec 9, 2017

After some more investigation it is clear that the call to preprocessor.loadFiles(tokens1, files); is the problem when running a scan via --project where lots of include paths were defined and many files need to be loaded. In my testproject i have many include paths (Qt, boost, GTest), in order to generate tokens, it traverses around 33 files from GTest alone. So the actual bottleneck is indeed the filesystem and it boils down to simplecpp::realFileName(). Since the included files are part of the tokens, my initial statement is still valid: moving the do i need to recheck this file-check before the tokenizing greatly improves re-scan times.

If i find time, i will try to create a MWE.

After writing this, i noticed: if the headers are not part of the condition but did change, cppcheck would probably have to rescan the file, right? This would not be the case when only looking at the hash of the file and not incorporate the included files somehow. @danmar what do you think?

@danmar
Copy link
Owner

danmar commented Dec 9, 2017

yes if a header is changed then it is preferable to rescan the source files that include it.

@rsBNT
Copy link
Contributor Author

rsBNT commented Dec 9, 2017

Then this PR is invalid.

I see two more approaches to solving the problem:

  • Be able to mark specific headers as system headers so they are assumed to be constant.
  • Cache the loading of header files. This will increase performance but also increase memory demand.

Both approaches are somewhat more work and will need some research and time.

@rsBNT rsBNT closed this Dec 9, 2017
@rsBNT
Copy link
Contributor Author

rsBNT commented Dec 9, 2017

For future investigations, this MWE contains a stripped down Google Test (to reduce the size, only headers and sources are included), a Property Sheet which references Google Test (adds two include paths) and three minimal files (to resemble a valid Google Test project). There are four configurations (Release/Debug, Win32/x64) and the project is scanned via (using Git Bash shell (cppcheck main built for Release|x64):

time ./bin/cppcheck --cppcheck-build-dir="/tmp/cppcheck/" --project="/tmp/checkDiffOnly/checkDiffOnly.sln" 2>"/tmp/cppcheck/result.xml" 1>"/tmp/cppcheck/output.txt"

First run with empty build dir and filesystem caches:

real 0m8.579s

Second run with build dir filled but filesystem caches wiped:

real 0m4.117s

Third run without wiping filesystem caches:

real 0m3.051s

One can deduct that tokenizing (including 39 headers from GTest) takes around 50% of the execution time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants