Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make site generation use multiprocessing #1709

Draft
wants to merge 3 commits into
base: 0.4.x
Choose a base branch
from

Conversation

bartfeenstra
Copy link
Owner

@bartfeenstra bartfeenstra commented Jul 14, 2024

This fixes #1516

To do

  • Add betty.typing.processsafe() which implies betty.typing.threadsafe()
  • Allow reduced (pickleable) values to implement a factory interface so they can he rehydrated after unpickling
  • Do all pickling in one go so we don't risk pickling the same things twice
  • Remove caches from job contexts (because memory caches cannot be multiprocess-safe), but introduce a new lightweight alternative, some kind of registry, just to do bookkeeping? This is also partly why performance right now is sub-optimal, as the image filters use this duplicates memory cache and therefore end up duplicating work between processes.
  • ACTUALLY caches were designed so cache item values MUST be pickleable.... This should be documented.
  • We should allow extensions to be 'pickled' as well, in order to share services between processes. For the Wikipedia extension, this would be its Wikipedia API rate limiter. Document this on the Extension class.
  • Allow KeyboardInterrupt to gracefully shutdown the executor
  • Should Jinja2 filters use the job pool?
  • do we still need App's process pool then?
  • Can we extract the pool functionality into a separate API/submodule? Even if we don't reuse this now, splitting the pool code from site generation reduces complexity and makes for easier testing. Update: we've now run into this in practice, where we want to mock something (testing third-party entity types), but obviously those mocks don't carry over to the processes in the pool. Separating the pool from the generation API means that we can test the pool fully (not needing any Project or Ancestry related mocking), site generation fully (by running the relevant tasks in the main process), and integration (by fully testing integration points and nothing more, because this is the part where conflicts occur)
  • if we embed the pool in the job context, the context can do bookkeeping (prevent duplicates) for every task if needed (opt-in?)

@bartfeenstra bartfeenstra added enhancement New feature or request python Pull requests that update Python code performance labels Jul 14, 2024
@bartfeenstra bartfeenstra added this to the 0.4.x milestone Jul 14, 2024
betty/cache/_base.py Outdated Show resolved Hide resolved
betty/generate/pool.py Outdated Show resolved Hide resolved
@@ -81,6 +82,13 @@ async def acquire(self, *, wait: bool = True) -> bool:
def release(self) -> None:
self._lock.release()

@classmethod
def multiprocessing(cls) -> Self:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test this

@bartfeenstra bartfeenstra force-pushed the generate-multiprocessing branch 4 times, most recently from ce27783 to cb25ac4 Compare July 17, 2024 20:01
@bartfeenstra bartfeenstra added the BC break Breaks backwards compatibility with existing integrations label Jul 25, 2024
@bartfeenstra bartfeenstra force-pushed the generate-multiprocessing branch 2 times, most recently from 3a62124 to 3d18899 Compare July 29, 2024 16:40
@bartfeenstra bartfeenstra force-pushed the generate-multiprocessing branch 2 times, most recently from 557db42 to ffc8d13 Compare August 13, 2024 14:25

class _GenerationProcessPool:
"""
Set up a worker process, before the worker starts performing tasks.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect doc?

@bartfeenstra bartfeenstra force-pushed the generate-multiprocessing branch 2 times, most recently from 5c0b866 to b66f670 Compare August 13, 2024 22:27
betty/__init__.py Outdated Show resolved Hide resolved
betty/_patch.py Outdated Show resolved Hide resolved
betty/cache/_base.py Outdated Show resolved Hide resolved
betty/concurrent.py Outdated Show resolved Hide resolved
@bartfeenstra bartfeenstra force-pushed the generate-multiprocessing branch 2 times, most recently from 7e4347b to a55f201 Compare August 26, 2024 10:31
@bartfeenstra bartfeenstra force-pushed the generate-multiprocessing branch 5 times, most recently from 88f3376 to 7edea59 Compare August 26, 2024 12:19
@bartfeenstra bartfeenstra modified the milestones: 0.4.x, 0.4.0 Sep 1, 2024
@@ -41,9 +41,6 @@ class EntityCollection(Generic[_TargetT], ABC):

__slots__ = ()

def __init__(self):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean this and similar initializers in a separate PR

@bartfeenstra bartfeenstra modified the milestones: 0.4.0, 0.5.x Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BC break Breaks backwards compatibility with existing integrations enhancement New feature or request performance python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove entity association finalization
1 participant