Skip to content

Improve indexing performance (update.py) #289

@tleb

Description

@tleb

Current index feels much slower than it ought to be. I am creating this issue to track work on this topic.

See this message by Franek for his observations (I/O wait bottleneck, database caching, OOM issues).

I've got proof-of-concepts for the following topics:

  • Rework of update.py. It is suboptimal. Having to take locks to write into databases makes everything really slow and (I believe) explains most of the performance issues (caused by IO wait). This can be confirmed by running the same commands without doing any processing on the output: it is much faster. I have a PoC for solving this, it does all database accesses in the main thread (and uses multiprocessing.Pool to spawn sub-processes).
  • Improve individual commands of script.sh. Some commands are more wasteful than needed.
    • The sed(1) call in list-blobs is a big bottleneck for no specific reason. This won't be a massive time saver as we are talking about a second per tag.
    • find-file-doc-comments.pl in parse-docs is really expensive. We could avoid calling it on files for which we know they cannot have any doc comment.

Those combined, for the first 5 Linux tags: I get wallclock/usr/sys 126s/1017s/395s versus 1009s/1341s/490s. For the old update.py, I passed my CPU count as argument ie 20.

Those changes will require a way to compare databases, see this message for reasoning behind. Solutions to this are either a custom Python script or a shell script that uses db_dump -p and diff, as recommended here.

There could however be other topics to improve performance. Are those worth it, that is the question. Probably not.

  • We might want to change the overall structure: calling into a shell script for each blob, spawning multiple processes, is not the fastest way to solve the problem. We could have script.sh commands take multiple blobs.
  • Or we could avoid script.sh and calls ctags or tokenize by ourselves.
  • We could change the database structure. Current database compresses well (14G becomes 5.2G after zstd -1), which means there is superfluous information. The value format could be optimized, possibly made binary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    indexingRelated to the index content — missing definitions/references, lexer bugs, new ctags features...

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions