Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build: simplify _config attribute data modelling #10586

Open
humitos opened this issue Jul 31, 2023 · 4 comments
Open

Build: simplify _config attribute data modelling #10586

humitos opened this issue Jul 31, 2023 · 4 comments
Labels
Accepted Accepted issue on our roadmap Feature New feature

Comments

@humitos
Copy link
Member

humitos commented Jul 31, 2023

We are currently using a models.JSONField for the Build._config attribute. This field can be used in two different ways:

  1. store the whole YAML configuration file
  2. save a build ID where the full/whole YAML configuration file is stored

This is basically to avoid duplicating the data on our database. However, it only avoids duplicated config on the same projects and only for consecutive builds.

I found this approach pretty confusing when exploring Build objects because we have to do extra operations to find out the real YAML used for that build (see for example the config @property at

@property
def config(self):
"""
Proxy to the configuration of the build.
:returns: The configuration used in the last successful build.
:rtype: dict
"""
last_build = (
self.builds(manager=INTERNAL).filter(
state=BUILD_STATE_FINISHED,
success=True,
).order_by('-date')
.only('_config')
.first()
)
if last_build:
return last_build.config
return None
)

Also, it makes hard to answer questions like "What are the projects using build.os: ubuntu-22.04?" in an easy way and many other similar questions that require querying the config file.

New approach proposal

Due to these limitations and complications, I want to propose creating a new model called BuildConfig that's used as a foreign key from the Build model:

class Build(models.Model):
    # ...
    readthedocs_yaml_data = models.ForeignKey(
        "BuildConfig"
        null=True,
        blank=True,
    )
    # ...

Then the new model would be:

class BuildConfig(TimeStampedModel):
    # ...
    data = models.JSONField(unique=True)
    # ...

Benefits

  • the "avoid duplicated" data is solved automatically by the database at a table level among all the projects, instead of per-project (e.g. BuildConfig.objects.get_or_create(data=...))
  • the previous point will reduce the size of the table considerably, since there won't be any duplicated config
  • getting the YAML for a particular build is going to be pretty fast since we have direct access to it
  • we can reduce the number of queries performed by using .select_related("readthedocs_yaml_data")
  • answer interesting/complex questions pretty easily using Django ORM (e.g. BuildConfig.objects.filter(data__build__os="ubuntu-22.04").projects.count() to answer my previous example question)
  • remove the requirement of a helper @property to get the YAML for a particular build
  • remove the requirement for a @config.setter helper
  • remove the requirement of having extra logic at Build.save
  • remove the need of using self._config_changed
  • remove having two different meanings for the same field
  • allows us to quickly show badges on "Build details page" as Anthony suggested for ext-theme

Summarizing, this approach keeps the same features but simplifies the application and modelling and gives us more features to analyze in an easier way platform usage and make better decisions based on them.

Rollout proposal

  1. add a Build.readthedocs_yaml_data field and create the new BuildConfig model
  2. start saving both fields Build.readthedocs_yaml_data and Build._config while doing the migration
  3. make a data migration to convert Build._config into BuildConfig and link Build.readthedocs_yaml_data to them
  4. make the application to use the new Build.readthedocs_yaml_data field
  5. remove the old code that uses Build._config field
@humitos humitos added Feature New feature Needed: design decision A core team decision is required labels Jul 31, 2023
@humitos
Copy link
Member Author

humitos commented Aug 3, 2023

Another important benefit that I'm seeing here about making querying our database easier: while working on the deprecation of config file and build.image I had to make multiple queries and iterate over all the builds returned to get the real value of Build._config

# Only check for the default version because if the project is using tags
# they won't be able to update those and we will send them emails forever.
# We can update this query if we consider later.
version = (
project.versions.filter(slug=project.default_version).only("id").first()
)
if version:
# Use a fixed date here to avoid changing the date on each run
years_ago = timezone.datetime(2022, 6, 1)
build = (
version.builds.filter(success=True, date__gt=years_ago)
.only("_config")
.order_by("-date")
.first()
)
if build and build.deprecated_config_used():
projects.add(project.slug)

Instead, we would be able to do all that in just one query:

Project.objects.filter(
  builds__version__slug=F("default_version"),
  builds__date__gt=timezone.now() - timezone.timedelta(days=365),
  builds__success=True,
  builds___config__build__image="readthedocs/build:latest",
)
.distinct()

That query will give me "projects building with build.image: readthedocs/build:latest on their default version with a successful build in the last year", which is a lot easier to read, maintain and communicate to the rest of the team.

Note that with the current structure, we won't get the accurate data because those builds where the config file is identical to their previous build will have _config defined as {"__config": 12345678}"

@ericholscher
Copy link
Member

ericholscher commented Aug 8, 2023

Yeah, it seems like our implementation is just a hacky way of doing a m2m, without any of the nice benefits. This seems like an obvious thing to move forward on, I don't really see a downside. I think the biggest issue will be performance on the query, since we have to do a join, but should be fine and similar to our existing approach.

@humitos
Copy link
Member Author

humitos commented Aug 9, 2023

I think the biggest issue will be performance on the query, since we have to do a join, but should be fine and similar to our existing approach.

This is true. The performance will be similar to the telemetry database which is not super fast, but we could create some basic indexes based on JSON fields if we know what are the queries we will be executing more often. We have some examples of these indexes on the telemetry database that we could eventually use as a starting point.

@ericholscher
Copy link
Member

Plan: Move forward with the refactor, store both the raw user config, and the rendered config. The rendered config is the most important, because it captures the defaults that were defined at the time of the build, which can change over time.

@humitos humitos added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Feature New feature
Projects
Status: Planned
Development

No branches or pull requests

2 participants