chore: parse in threadpool instead of on event loop #802

Askir · 2025-06-10T15:25:01Z

This doesn't make use of multiple CPU cores but it does allow to continue other vectorizers concurrently. If we push down e.g. processing time it would also allow us to stop somewhat gracefully (although you can't really kill a running future).

Copilot

Pull Request Overview

This PR offloads document parsing to a thread pool to improve concurrency across vectorizers by making parsing methods asynchronous and running CPU-bound work off the event loop.

Changed parsing.parse interface to async and awaited it in the embedding generator.
Introduced a global ThreadPoolExecutor and updated parse_doc implementations to use run_in_executor.
Deferred parsing imports and wrapped blocking logic in synchronous helper methods.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
projects/pgai/pgai/vectorizer/vectorizer/vectorizer.py	Await the now-async `parsing.parse` call in `_generate_embeddings`.
projects/pgai/pgai/vectorizer/parsing.py	Converted parse methods to `async`, added `_PARSING_EXECUTOR`, and offloaded blocking work to threads.

Comments suppressed due to low confidence (2)

projects/pgai/pgai/vectorizer/parsing.py:20

Please add or update unit tests to cover the new async parse methods and validate that blocking logic is correctly executed in the thread pool.

async def parse(self, _1: dict[str, Any], payload: str | LoadedDocument) -> str:

projects/pgai/pgai/vectorizer/vectorizer/vectorizer.py:1191

Since parse is now async, confirm all other callers of parsing.parse have been updated to await this method and update any related documentation to reflect this breaking change.

payload = await self.vectorizer.config.parsing.parse(item, payload)

projects/pgai/pgai/vectorizer/parsing.py

Copilot · 2025-06-11T09:54:33Z

projects/pgai/pgai/vectorizer/parsing.py

 from pgai.vectorizer.loading import LoadedDocument

+# Thread pool for CPU-intensive parsing operations
+_PARSING_EXECUTOR = ThreadPoolExecutor(max_workers=4, thread_name_prefix="parsing")


[nitpick] Hardcoding max_workers=4 may not scale across environments—consider making this configurable or using os.cpu_count() to align with available cores.

Suggested change

_PARSING_EXECUTOR = ThreadPoolExecutor(max_workers=4, thread_name_prefix="parsing")

max_workers = int(os.getenv("PARSING_MAX_WORKERS", os.cpu_count() or 4))

_PARSING_EXECUTOR = ThreadPoolExecutor(max_workers=max_workers, thread_name_prefix="parsing")

That's actually a good suggestion. Good bot.

The os.cpu_count doesn't make a lot of sense since python is single threaded. This just configures how many documents can be parsed in parallel. But I'll make it configurable.

If the parsing is CPU bound why not use a ProcessPoolExecutor instead?

NIT: python is not single threaded, it uses a lock to keep threads from executing in parallel. If you have IO bound tasks (reading files, network requests), the GIL is release so it gives concurrency benefits.

Okay so more precisely python wont utilize more than 1 core to execute that parsing, I think the argument still stands. The cloud lambda also runs only with one core so multiprocessing instead of threading is just additional overhead. I think.

The cloud lambda is configured to have 2560MiB of RAM, which according to this S/O post should correspond to two cores.

alejandrodnm · 2025-06-18T14:16:04Z

projects/pgai/pgai/vectorizer/parsing.py


+# Thread pool for CPU-intensive parsing operations
+max_workers = int(os.getenv("PARSING_MAX_WORKERS", 4))
+_PARSING_EXECUTOR = ThreadPoolExecutor(


I mentioned here, but why not use a ProccessPoolExecutor?

chore: parse in threadpool instead of on event loop

593d033

Askir temporarily deployed to internal-contributors June 10, 2025 15:25 — with GitHub Actions Inactive

Askir marked this pull request as ready for review June 10, 2025 16:08

Askir requested a review from a team as a code owner June 10, 2025 16:08

smoya requested a review from Copilot June 11, 2025 09:52

Copilot AI reviewed Jun 11, 2025

View reviewed changes

Askir temporarily deployed to internal-contributors June 17, 2025 15:47 — with GitHub Actions Inactive

fix: review comments

a997ed6

Askir force-pushed the jascha/use-threadpool-executor branch from 5e54e3b to a997ed6 Compare June 17, 2025 16:04

Askir temporarily deployed to internal-contributors June 17, 2025 16:04 — with GitHub Actions Inactive

Askir requested a review from smoya June 17, 2025 16:13

alejandrodnm reviewed Jun 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: parse in threadpool instead of on event loop #802

chore: parse in threadpool instead of on event loop #802

Uh oh!

Askir commented Jun 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jun 11, 2025

Uh oh!

smoya Jun 11, 2025 •

edited

Loading

Uh oh!

Askir Jun 17, 2025

Uh oh!

alejandrodnm Jun 18, 2025 •

edited

Loading

Uh oh!

alejandrodnm Jun 18, 2025

Uh oh!

Askir Jun 20, 2025

Uh oh!

JamesGuthrie Jun 26, 2025

Uh oh!

alejandrodnm Jun 18, 2025

Uh oh!

Uh oh!

	_PARSING_EXECUTOR = ThreadPoolExecutor(max_workers=4, thread_name_prefix="parsing")
	max_workers = int(os.getenv("PARSING_MAX_WORKERS", os.cpu_count() or 4))
	_PARSING_EXECUTOR = ThreadPoolExecutor(max_workers=max_workers, thread_name_prefix="parsing")

chore: parse in threadpool instead of on event loop #802

Are you sure you want to change the base?

chore: parse in threadpool instead of on event loop #802

Uh oh!

Conversation

Askir commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

smoya Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Askir Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

alejandrodnm Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alejandrodnm Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Askir Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

JamesGuthrie Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

alejandrodnm Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Askir commented Jun 10, 2025 •

edited

Loading

smoya Jun 11, 2025 •

edited

Loading

alejandrodnm Jun 18, 2025 •

edited

Loading