Skip to content

Create partitions with more similar size #619

Open
@ajnavarro

Description

@ajnavarro

Partitions with similar size

Description

Right now, our partition implementation is a partition per repository, that means, partitions can have really different sizes, causing problems like really long times processing 5% of the total amount repositories (only one thread parsing a huge repo).

With this proposal, I want to normalize partition size changing the way we split the data into several threads.

Proposal

Change actual partition ID:

[REPOSITORY_ID]

To:

[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]

Values description

  • [REPOSITORY_ID]: gitbase ID identifying the repository. Examples gitbase, kubernetes, spark.
  • [PACKFILE_HASH]: packfile hash. Example: ccbe18d4ba50a14a28855a7b53880d93d87dc498
  • [OFFSET_FROM]: packfile offset where the first object for that partition is. Example 23144
  • [OFFSET_TO]: packfile offset where the last object for that partition is. Example 7009211

How to do repartition

We need to decide the number of objects that will be on each partition to generate the offsets. That object count will be a magic number and won't be changed. If we change it over configuration, indexes and other internal modules will not work correctly.

Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.

Taking into account repositories updates, this approach is totally valid. Partitions are related to the packfile hash, so the content cannot change, only be deleted or create new packfiles.

How to handle this partitions on each main tables

Object tables (Commits, Blobs, Trees, Tree entries)

Each partition will only iterate the objects on their ranges, so no objects will be duplicated at the output.

References

Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.

Files

Files will be appearing only on the partition where the root tree entry is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.

Caveats

Problems

  • We need to check how go-git behaves when several threads are using it.

Benefits

  • Partitions will be more homogeneous and will take more or less the same time to process, giving to the user a good approximation about the size of the query and remaining time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance improvementsproposalproposal for new additions or changesresearchSomething that requires research

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions