Improve indexing performance (`update.py`)

Current index *feels* much slower than it ought to be. I am creating this issue to track work on this topic.

See [this message](https://github.com/bootlin/elixir/pull/288#issuecomment-2176775615) by Franek for his observations (I/O wait bottleneck, database caching, OOM issues).

I've got proof-of-concepts for the following topics:

 - Rework of `update.py`. It is suboptimal. Having to take locks to write into databases makes everything really slow and (I believe) explains most of the performance issues (caused by IO wait). This can be confirmed by running the same commands without doing any processing on the output: it is much faster. I have a PoC for solving this, it does all database accesses in the main thread (and uses `multiprocessing.Pool` to spawn sub-processes).
 - Improve individual commands of `script.sh`. Some commands are more wasteful than needed.
    - The `sed(1)` call in `list-blobs` is a big bottleneck for no specific reason. This won't be a massive time saver as we are talking about a second per tag.
    - `find-file-doc-comments.pl` in `parse-docs` is really expensive. We could avoid calling it on files for which we know they cannot have any doc comment.

Those combined, for the first 5 Linux tags: I get wallclock/usr/sys 126s/1017s/395s versus 1009s/1341s/490s. For the old `update.py`, I passed my CPU count as argument ie `20`.

Those changes will require a way to compare databases, see [this message](https://github.com/bootlin/elixir/pull/288#issuecomment-2176609589) for reasoning behind. Solutions to this are either a custom Python script or a shell script that uses `db_dump -p` and `diff`, as recommended [here](https://github.com/bootlin/elixir/pull/288#issuecomment-2176751408).

There could however be other topics to improve performance. Are those worth it, that is the question. Probably not.

 - We might want to change the overall structure: calling into a shell script for each blob, spawning multiple processes, is not the fastest way to solve the problem. We could have `script.sh` commands take multiple blobs. 
 - Or we could avoid `script.sh` and calls `ctags` or tokenize by ourselves.
 - We could change the database structure. Current database compresses well (14G becomes 5.2G after `zstd -1`), which means there is superfluous information. The value format could be optimized, possibly made binary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve indexing performance (`update.py`) #289

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve indexing performance (update.py) #289

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Improve indexing performance (`update.py`) #289