Build: simplify `_config` attribute data modelling #10586

humitos · 2023-07-31T16:16:28Z

We are currently using a models.JSONField for the Build._config attribute. This field can be used in two different ways:

store the whole YAML configuration file
save a build ID where the full/whole YAML configuration file is stored

This is basically to avoid duplicating the data on our database. However, it only avoids duplicated config on the same projects and only for consecutive builds.

I found this approach pretty confusing when exploring Build objects because we have to do extra operations to find out the real YAML used for that build (see for example the config @property at

readthedocs.org/readthedocs/builds/models.py

Lines 313 to 331 in 12cbb3d

    
               @property 
        
               def config(self): 
        
                   """ 
        
                   Proxy to the configuration of the build. 
        
                   :returns: The configuration used in the last successful build. 
        
                   :rtype: dict 
        
                   """ 
        
                   last_build = ( 
        
                       self.builds(manager=INTERNAL).filter( 
        
                           state=BUILD_STATE_FINISHED, 
        
                           success=True, 
        
                       ).order_by('-date') 
        
                       .only('_config') 
        
                       .first() 
        
                   ) 
        
                   if last_build: 
        
                       return last_build.config 
        
                   return None

)

Also, it makes hard to answer questions like "What are the projects using build.os: ubuntu-22.04?" in an easy way and many other similar questions that require querying the config file.

New approach proposal

Due to these limitations and complications, I want to propose creating a new model called BuildConfig that's used as a foreign key from the Build model:

class Build(models.Model):
    # ...
    readthedocs_yaml_data = models.ForeignKey(
        "BuildConfig"
        null=True,
        blank=True,
    )
    # ...

Then the new model would be:

class BuildConfig(TimeStampedModel):
    # ...
    data = models.JSONField(unique=True)
    # ...

Benefits

the "avoid duplicated" data is solved automatically by the database at a table level among all the projects, instead of per-project (e.g. BuildConfig.objects.get_or_create(data=...))
the previous point will reduce the size of the table considerably, since there won't be any duplicated config
getting the YAML for a particular build is going to be pretty fast since we have direct access to it
we can reduce the number of queries performed by using .select_related("readthedocs_yaml_data")
answer interesting/complex questions pretty easily using Django ORM (e.g. BuildConfig.objects.filter(data__build__os="ubuntu-22.04").projects.count() to answer my previous example question)
remove the requirement of a helper @property to get the YAML for a particular build
remove the requirement for a @config.setter helper
remove the requirement of having extra logic at Build.save
remove the need of using self._config_changed
remove having two different meanings for the same field
allows us to quickly show badges on "Build details page" as Anthony suggested for ext-theme

Summarizing, this approach keeps the same features but simplifies the application and modelling and gives us more features to analyze in an easier way platform usage and make better decisions based on them.

Rollout proposal

add a Build.readthedocs_yaml_data field and create the new BuildConfig model
start saving both fields Build.readthedocs_yaml_data and Build._config while doing the migration
make a data migration to convert Build._config into BuildConfig and link Build.readthedocs_yaml_data to them
make the application to use the new Build.readthedocs_yaml_data field
remove the old code that uses Build._config field

The text was updated successfully, but these errors were encountered:

humitos · 2023-08-03T12:19:33Z

Another important benefit that I'm seeing here about making querying our database easier: while working on the deprecation of config file and build.image I had to make multiple queries and iterate over all the builds returned to get the real value of Build._config

readthedocs.org/readthedocs/projects/tasks/utils.py

Lines 235 to 251 in 1b74999

    
           # Only check for the default version because if the project is using tags 
        
           # they won't be able to update those and we will send them emails forever. 
        
           # We can update this query if we consider later. 
        
           version = ( 
        
               project.versions.filter(slug=project.default_version).only("id").first() 
        
           ) 
        
           if version: 
        
               # Use a fixed date here to avoid changing the date on each run 
        
               years_ago = timezone.datetime(2022, 6, 1) 
        
               build = ( 
        
                   version.builds.filter(success=True, date__gt=years_ago) 
        
                   .only("_config") 
        
                   .order_by("-date") 
        
                   .first() 
        
               ) 
        
               if build and build.deprecated_config_used(): 
        
                   projects.add(project.slug)

Instead, we would be able to do all that in just one query:

Project.objects.filter(
  builds__version__slug=F("default_version"),
  builds__date__gt=timezone.now() - timezone.timedelta(days=365),
  builds__success=True,
  builds___config__build__image="readthedocs/build:latest",
)
.distinct()

That query will give me "projects building with build.image: readthedocs/build:latest on their default version with a successful build in the last year", which is a lot easier to read, maintain and communicate to the rest of the team.

Note that with the current structure, we won't get the accurate data because those builds where the config file is identical to their previous build will have _config defined as {"__config": 12345678}"

ericholscher · 2023-08-08T22:43:56Z

Yeah, it seems like our implementation is just a hacky way of doing a m2m, without any of the nice benefits. This seems like an obvious thing to move forward on, I don't really see a downside. I think the biggest issue will be performance on the query, since we have to do a join, but should be fine and similar to our existing approach.

humitos · 2023-08-09T07:49:24Z

I think the biggest issue will be performance on the query, since we have to do a join, but should be fine and similar to our existing approach.

This is true. The performance will be similar to the telemetry database which is not super fast, but we could create some basic indexes based on JSON fields if we know what are the queries we will be executing more often. We have some examples of these indexes on the telemetry database that we could eventually use as a starting point.

ericholscher · 2023-08-09T16:36:22Z

Plan: Move forward with the refactor, store both the raw user config, and the rendered config. The rendered config is the most important, because it captures the defaults that were defined at the time of the build, which can change over time.

humitos added Feature New feature Needed: design decision A core team decision is required labels Jul 31, 2023

humitos added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build: simplify `_config` attribute data modelling #10586

Build: simplify `_config` attribute data modelling #10586

humitos commented Jul 31, 2023 •

edited

Loading

humitos commented Aug 3, 2023

ericholscher commented Aug 8, 2023 •

edited

Loading

humitos commented Aug 9, 2023

ericholscher commented Aug 9, 2023

Build: simplify _config attribute data modelling #10586

Build: simplify _config attribute data modelling #10586

Comments

humitos commented Jul 31, 2023 • edited Loading

New approach proposal

Benefits

Rollout proposal

humitos commented Aug 3, 2023

ericholscher commented Aug 8, 2023 • edited Loading

humitos commented Aug 9, 2023

ericholscher commented Aug 9, 2023

Build: simplify `_config` attribute data modelling #10586

Build: simplify `_config` attribute data modelling #10586

humitos commented Jul 31, 2023 •

edited

Loading

ericholscher commented Aug 8, 2023 •

edited

Loading