Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean arttifacts before and after builds #5409

Closed

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Mar 6, 2019

We were doing this for some builders like html and pdf, but not for all ot them.
We are going to save some space after this.

This doesn't affect the builders so much, due that we use several builders, not just one. So depending of the load balancer a user could be using an empty or new build.

A next step should be trying to remove the whole project after building, but this is first step to test how it goes.

@@ -845,9 +845,11 @@ def build_docs_html(self):
if self.build_force:
html_builder.force()
html_builder.append_conf()
html_builder.clean()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should refactor build_docs_html to use build_docs_class, but not so trivial code needs to be moved around, so keeping simple this for now. I'm creating a new issue to refactor this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd stsewd requested a review from a team March 6, 2019 22:37
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

It does not seem to save too much space, but considering that we were cleaning before building, it doesn't make sense to keep that useless data until the next build.

A next step should be trying to remove the whole project after building, but this is first step to test how it goes.

If we go in this direction, this should be configurable since we can't do this in the corporate site.

readthedocs/doc_builder/backends/sphinx.py Show resolved Hide resolved
Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it might be easier that this approach. I believe we can avoid mounting a volume into docker and it will be the same effect as explicitly cleaning the files -- also, it treats storage as truly ephemeral if we don't mount a volume into docker.

Perhaps under feature flag as well, we probably want to test this approach more

Also also, it can't affect our commercial hosting, where some repos are very large.

@stsewd
Copy link
Member Author

stsewd commented Mar 11, 2019

I believe we can avoid mounting a volume into docker and it will be the same effect as explicitly cleaning the files

We need that to do the clone step, at least you mean generating the artifacts inside the container, which I believe will complicate the process of copying the artifacts to the servers, also, the docker container will be running for more time, and maybe it afffects the performance.

https://docs.docker.com/storage/

  • The data doesn’t persist when that container no longer exists, and it can be difficult to get the data out of the container if another process needs it.
  • A container’s writable layer is tightly coupled to the host machine where the container is running. You can’t easily move the data somewhere else.
  • Writing into a container’s writable layer requires a storage driver to manage the filesystem. The storage driver provides a union filesystem, using the Linux kernel. This extra abstraction reduces performance as compared to using data volumes, which write directly to the host filesystem.

@stsewd
Copy link
Member Author

stsewd commented Mar 11, 2019

Also also, it can't affect our commercial hosting, where some repos are very large.

Do you mean with this approach? I don't think so, we already clean the builders for html and pdf (the main/bigger ones). There isn't much to lose there.

We'll need to add a feature flag in the case we want to remove the whole repo after each build.

@agjohnson
Copy link
Contributor

Ah this is true, we still need the volume mount so the build servers can sync artifacts. I suppose we could alter the container creation to capture only the build output in a volume if we wanted. What opinions do you have on that? Is it worth doing that on top of this PR?

I think the work here makes sense on it's own, if we get more aggressive we can consider a feature flag that changes the volume mount to just an output volume only.

You bring up a good point about storage vs build time though. If we move too far away from using storage, we're just going to have longer build times, which means more build servers, and the savings from less disk space will slowly be lost.

Archiving to blob storage, if performant enough, is probably the most balanced approach that we'll strike. This will greatly depend on the speed of VM <-> blob storage transfer of course.

@stsewd
Copy link
Member Author

stsewd commented Mar 11, 2019

I suppose we could alter the container creation to capture only the build output in a volume if we wanted. What opinions do you have on that? Is it worth doing that on top of this PR?

For me looks like the same, basically we are removing that data from inside the container, mounting another volume we still need to remove the data from outside. Or maybe do you mean only generate the build inside the container and the copy that to a volume? Because we do some kind of copy first of the data from the _build dir to another directory. But still, we will need to change the _build dir location (not a volume), and then do the copy to a volume. So, at the end we'll still have to do a copy operation, which can reduce performance because it's inside the container.

So, the next step that I wanted to take is refactor our builders to not do that extra copy from the _build directory to the other location, but just put the output to the other location. That's a little more bigger than this (I'm trying to refactor our builders in little steps #5418)

With that I don't see any harm in this PR, and we can keep doing the things above in a later step ^

@humitos
Copy link
Member

humitos commented Mar 12, 2019

In case we want to change the volumes mounted on the container, we could mount the ones that we need using one per purpose (instead of one:

  1. git clone
  2. output generated by sphinx (html, pdf, etc)

The rest of the operation can be done in any other directory inside the container and will be completely removed after the container gets killed.

This approach, also gives us the flexibility of mounting 1) only for corporate site for example --where we need to keep the repos longer.

@stsewd
Copy link
Member Author

stsewd commented Mar 12, 2019

If we do #5409 (comment), we still need to remove the generated file, which this PR is about :)

@ericholscher
Copy link
Member

We don't want to remove artifacts, we want to remove all of the build data (checkout, environment, artifacts) In fact, removing the artifacts is probably the last thing we want to do, since they need to be synced from the webs, and that should be done after syncing, not in the build logic.

@stsewd
Copy link
Member Author

stsewd commented Mar 14, 2019

We copy those to another directory before doing the cleaning, so they are synced anyway.

@ericholscher
Copy link
Member

ericholscher commented Mar 14, 2019

This feels like wasted effort if we're going to delete all the files from the server after the build. We should work on deleting all the checkout and environment, not trying to optimize small things that won't matter after doing that work.

@stsewd
Copy link
Member Author

stsewd commented Mar 28, 2019

ok, I'm closing this PR, I think we are going for another path now.

@stsewd stsewd closed this Mar 28, 2019
@stsewd stsewd deleted the remove-arttifacts-after-build branch March 28, 2019 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants