Discussion: Preventing COW in LocalExecutor Workers #58143

wjddn279 · 2025-11-10T13:00:22Z

wjddn279
Nov 10, 2025

I investigated the final part of this issue. kaxil's work resolved the first two problems in version 3.1.2.

However, LocalExecutor still shows continuous memory increase. Using smem, I observed that worker processes starting at 20-30MB grow to over 100MB after 1-2 hours, steadily rising to ~50MB then suddenly jumping to ~100MB (another case)

As jarek mentioned, I confirmed this is due to parent process COW by examining worker process pmap, which showed increased PSS for memory addresses shared with the parent process.

Initially I tried minimizing pre-fork memory allocation, but this was ineffective since initialization loads heavy libraries/objects (sqlalchemy, fastapi, LazyDeserializedDAG, etc.) consuming at least 150MB.

I found article, which describes gc.freeze as preventing COW by moving objects to permanent locations. I confirmed this prevents COW in our workers. (Benchmark results will be attached with the PR.)

Before sharing the solution code, I'd like to discuss:

Are there side effects of gc.freeze?
The approach is simple and works well for us, but I'd like others' opinions on potential issues. This would only apply to LocalExecutor, and by using gc.freeze -> worker fork -> gc.unfreeze to release from permanent state, it prevents COW without affecting scheduler operations
Should we maintain LocalExecutor's current worker creation method?
Current lazy loading would require repeating gc.freeze/unfreeze cycles, potentially affecting scheduler performance. Pre-creating workers at parallelism level (like v2.x) would be more efficient. Of course, worker processes might die and need to be created irregularly, but this would occur much less frequently than the current method.
Alternatively, to maintain the existing lazy loading approach, we could fork a gc.freeze'd snapshot process for worker generation (scheduler -> fork (gc freezed snapshot) -> worker), but this adds complexity.
Should the Job class be coupled with the executor?
The executor is currently loaded and defined as a field in the Job class. The only class that uses this is the scheduler. It would be more intuitive to explicitly separate the executor loading and pass it as an argument to SchedulerJobRunner. This would also prevent unnecessary memory forking in the LocalExecutor case (even if the COW issue is resolved).

PS. I'd like to document memory management insights from this investigation (e.g., which objects are heavy and should be lazy-loaded). Is there existing documentation for this?

potiuk · 2025-11-10T15:13:58Z

potiuk
Nov 10, 2025
Collaborator

This is a great investigation @wjddn279 !

It's great to find that my "hypothesis" finds confirmation - the unexpected behaviour of gc, actually increasing memory usage in our scenario for forks was largely theorethical, after I've learned how the internals of gc and reference counting works. Glad my intuition did not fail me on that one.

@ashb @kaxil @amoghrajesh are likely those who should be looking at this and we should likely have an agreed approach, and I love the idea of actually documenting memory behaviour somewhere.

My points for your discussion points:

Are there side effects of gc.freeze?

I think we should be able to use it. The documentation about gc.freeze() recommends this pattern for forking and gc.freeze() is available since Python 3.7, so I don't see why we would not be able to use it. We could also check other usages of fork() in our code and see if it might help as well with memory usage. Likely DagFileProcessor might benefit a bit from it - though there maybe disabling and freezing gc continuously might have more side effects, I am quite sure however that gc.freeze approach for Task SDK and task execution might actually be very helpful to manage memory duplication for the forked process that actually executes the task. Probably worth testing the effect.

https://docs.python.org/3/library/gc.html#gc.freeze

gc.freeze()
Freeze all the objects tracked by the garbage collector; move them to a permanent generation and ignore them in all the future collections.

If a process will fork() without exec(), avoiding unnecessary copy-on-write in child processes will maximize memory sharing and reduce overall memory usage. This requires both avoiding creation of freed “holes” in memory pages in the parent process and ensuring that GC collections in child processes won’t touch the gc_refs counter of long-lived objects originating in the parent process. To accomplish both, call gc.disable() early in the parent process, gc.freeze() right before fork(), and gc.enable() early in child processes.

Should we maintain LocalExecutor's current worker creation method?

I agree, that if we manage memory in forks more efficiently with gc.freeze() than with the current lazy-loading - that might be better approach indeed.

Should the Job class be coupled with the executor?

Yeah. I remember refactoring it in Airflow 2 and trying to get rid of the Executor class from it, because essentially Job is a database model and keeping executor as a field there never felt right. Back then (1.5 year ago or so) I succeded only partially - but since then a lot of the code for Airlfow 3 have changed in this area, so refactoring it out seems like a good idea.

3 replies

ashb Nov 10, 2025
Collaborator

The main reason for forking in Local Executor: right now Airflow is very careful to only load DAG code in a non-persistent process, as sys.modules is global

potiuk Nov 10, 2025
Collaborator

Which I think is a good reason and there is no easy way to do it otherwise.

wjddn279 Nov 11, 2025
Author

Yes, I think applying this to other cases that use fork, such as DagFileProcessor, could also provide performance benefits. Since objects loaded in memory initially have a very low probability of being garbage collected, moving them to the permanent generation to reduce GC overhead could have a positive impact in fork scenarios. Of course, this needs testing.

Additionally, this approach could potentially prevent unintended issues caused by GC in forked processes.

wjddn279 · 2025-11-11T06:02:29Z

wjddn279
Nov 11, 2025
Author

With the above work, the COW issue has been resolved, and we no longer observe cases of sudden large memory increases. However, I did notice a gradual increase in heap memory per worker. In a high-load environment where 100 DAGs with 5 tasks each run every minute, memory increased by about 20MB over 12 hours.

Using memray, I found that this phenomenon was related to worker log file creation. A memory leak can be observed in the section where worker log files are generated.
1106-2315.html
1107-1209.html

This issue occurs identically in the current version 3.1.2, confirming it's not related to gc.freeze. However, if log file creation doesn't happen frequently, it's likely not a significant problem. I think it would be good to resolve this COW issue first, then have discussions to address this separate issue.

0 replies

potiuk · 2025-11-11T09:24:04Z

potiuk
Nov 11, 2025
Collaborator

Additionally, this approach could potentially prevent #56044 in forked processes.

Very good point. I was looking earlier at that issue and It does look that gc.freeze() should help there without any MySQL specific solution and without adding unnecessary complexity in handling the connections.

Using memray, I found that this phenomenon was related to worker log file creation. A memory leak can be observed in the section where worker log files are generated.

This is a great catch. I think this is a fantastic find. Yes what happens in that case - we are creating new loggger every time we start a new task and my guess is that the loggers are never deleted and hold all the related resources.

It's the first time I look at the way how we are using structlog) - but I think for performance reasons we are using "cache_logger_on_first_use" and that holds the loggers we create in memory, held by a global cache dictionary - and for whatever reason (configuration likely - different file name used for different tasks) - each such logger for different tasks is a new cached logger that is only reused from cache while the task is running and then never deleted after the task completes - but I think it needs a bit deeper knowledge how structlog is used in this case in the Local Executor.

Thanks for being so dilligent! @ashb @amoghrajesh -> WDYT ? I think those two finds are quite actionable and we should be able to apply some memory optimisations with it ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Preventing COW in LocalExecutor Workers #58143

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Discussion: Preventing COW in LocalExecutor Workers #58143

Uh oh!

Uh oh!

wjddn279 Nov 10, 2025

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

potiuk Nov 10, 2025 Collaborator

Uh oh!

ashb Nov 10, 2025 Collaborator

Uh oh!

potiuk Nov 10, 2025 Collaborator

Uh oh!

wjddn279 Nov 11, 2025 Author

Uh oh!

wjddn279 Nov 11, 2025 Author

Uh oh!

Uh oh!

potiuk Nov 11, 2025 Collaborator

wjddn279
Nov 10, 2025

Replies: 3 comments 3 replies

potiuk
Nov 10, 2025
Collaborator

ashb Nov 10, 2025
Collaborator

potiuk Nov 10, 2025
Collaborator

wjddn279 Nov 11, 2025
Author

wjddn279
Nov 11, 2025
Author

potiuk
Nov 11, 2025
Collaborator