Description
Current index feels much slower than it ought to be. I am creating this issue to track work on this topic.
See this message by Franek for his observations (I/O wait bottleneck, database caching, OOM issues).
I've got proof-of-concepts for the following topics:
- Rework of
update.py
. It is suboptimal. Having to take locks to write into databases makes everything really slow and (I believe) explains most of the performance issues (caused by IO wait). This can be confirmed by running the same commands without doing any processing on the output: it is much faster. I have a PoC for solving this, it does all database accesses in the main thread (and usesmultiprocessing.Pool
to spawn sub-processes). - Improve individual commands of
script.sh
. Some commands are more wasteful than needed.- The
sed(1)
call inlist-blobs
is a big bottleneck for no specific reason. This won't be a massive time saver as we are talking about a second per tag. find-file-doc-comments.pl
inparse-docs
is really expensive. We could avoid calling it on files for which we know they cannot have any doc comment.
- The
Those combined, for the first 5 Linux tags: I get wallclock/usr/sys 126s/1017s/395s versus 1009s/1341s/490s. For the old update.py
, I passed my CPU count as argument ie 20
.
Those changes will require a way to compare databases, see this message for reasoning behind. Solutions to this are either a custom Python script or a shell script that uses db_dump -p
and diff
, as recommended here.
There could however be other topics to improve performance. Are those worth it, that is the question. Probably not.
- We might want to change the overall structure: calling into a shell script for each blob, spawning multiple processes, is not the fastest way to solve the problem. We could have
script.sh
commands take multiple blobs. - Or we could avoid
script.sh
and callsctags
or tokenize by ourselves. - We could change the database structure. Current database compresses well (14G becomes 5.2G after
zstd -1
), which means there is superfluous information. The value format could be optimized, possibly made binary.