Description
Partitions with similar size
Description
Right now, our partition implementation is a partition per repository, that means, partitions can have really different sizes, causing problems like really long times processing 5% of the total amount repositories (only one thread parsing a huge repo).
With this proposal, I want to normalize partition size changing the way we split the data into several threads.
Proposal
Change actual partition ID:
[REPOSITORY_ID]
To:
[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]
Values description
[REPOSITORY_ID]
: gitbase ID identifying the repository. Examplesgitbase
,kubernetes
,spark
.[PACKFILE_HASH]
: packfile hash. Example:ccbe18d4ba50a14a28855a7b53880d93d87dc498
[OFFSET_FROM]
: packfile offset where the first object for that partition is. Example23144
[OFFSET_TO]
: packfile offset where the last object for that partition is. Example7009211
How to do repartition
We need to decide the number of objects that will be on each partition to generate the offsets. That object count will be a magic number and won't be changed. If we change it over configuration, indexes and other internal modules will not work correctly.
Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.
Taking into account repositories updates, this approach is totally valid. Partitions are related to the packfile hash, so the content cannot change, only be deleted or create new packfiles.
How to handle this partitions on each main tables
Object tables (Commits, Blobs, Trees, Tree entries)
Each partition will only iterate the objects on their ranges, so no objects will be duplicated at the output.
References
Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.
Files
Files will be appearing only on the partition where the root tree entry is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.
Caveats
Problems
- We need to check how go-git behaves when several threads are using it.
Benefits
- Partitions will be more homogeneous and will take more or less the same time to process, giving to the user a good approximation about the size of the query and remaining time.