Skip to content

Conversation

@olegkachur-e
Copy link
Contributor

Extend cleanup logic for the PythonVirtualenvOperator, as previous implementation cleans up the venv directory itself, but kept the pycache trace.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@VladaZakharova
Copy link
Contributor

Hi @potiuk !
Can you please check the changes here or ping someone who can check? :)
thanks

@olegkachur-e olegkachur-e force-pushed the cleanup_pycache_pythonvitualenvoperator branch from 3ca211d to f9bbb4e Compare July 17, 2025 16:06
@potiuk
Copy link
Member

potiuk commented Jul 19, 2025

There is a potential problem I see here. On Celery worker when you have multiple tasks in parallel, potentially the same __pycache__ folder is used by parallel workers running. I am not sure if we can do what you want to do without getting into some race conditions.

Also I am not sure what problem it solves to be honest - the __pycache__ is generally very well rebuilt when needed. I can think of one case where it can be problematic - when you are using same __pycache__ folders for different Python versions, you will have some version code errors.

But there, I'd say a better solution might be to set PYTHONDONOTWRITEBYTECODE before launching interpreter. Yes it might be slower a bit because parsing and bytecode generation happens every time, but in vast majority of cases when tasks runs for seconds, this will be negligible overhead - unless you have vast amount of code to parse.

Could you please elaborate a bit more what exactly problem we are trying to solve here?

@olegkachur-e
Copy link
Contributor Author

@potiuk Thank you for the attention and detailed review!

Could you please elaborate a bit more what exactly problem we are trying to solve here?

The exact problem - there is as a side effect of running the PythonVirtualenvOperator- some files left, that are not cleaned-up.
My STR in breeze was to set the PYTHONPYCACHEPREFIX="__pycache__" and run operator with simple requirements.
After the run there was a directory left as __pycache__/tmp/<venv_temp_name>, so I suggest cleaning it up.
I think it should reproduce without PYTHONPYCACHEPREFIX as well, it's just a bit harder to trace.

Also I am not sure what problem it solves to be honest - the __pycache__ is generally very well rebuilt when needed. I can think of one case where it can be problematic - when you are using same __pycache__ folders for different Python versions, you will have some version code errors.

With behavior mentioned in the STR above, I see an issue with many operator runs, it may eventually pollute file system, exhaust resources on cloud providers environments.

As for __pycache__ rebuilt when needed - I think here is a bit of a corner case here: venv created, executed in subprocess and deleted, so there is "nobody left" to rebuilt or clean it up (to my best understanding).

But there, I'd say a better solution might be to set PYTHONDONOTWRITEBYTECODE before launching interpreter. Yes it might be slower a bit because parsing and bytecode generation happens every time, but in vast majority of cases when tasks runs for seconds, this will be negligible overhead - unless you have vast amount of code to parse.

I couldn't agree more! There was also my first intention.
Unfortunately with my experiment runs it doesn't completely fix the problem: there was a reproduction with "no uv" execution, so I decided to make an attempt to clean-up on each run, matching exact temp venv directory name that being used.

There is a potential problem I see here. On Celery worker when you have multiple tasks in parallel, potentially the same __pycache__ folder is used by parallel workers running. I am not sure if we can do what you want to do without getting into some race conditions.

So you mean with TemporaryDirectory(prefix="venv") as tmp_dir: potentially could result from the same name on different workers?
Then in the worst case we will fail to delete again so have a log warning message or are there any other risks?

@olegkachur-e olegkachur-e force-pushed the cleanup_pycache_pythonvitualenvoperator branch 2 times, most recently from bd62ed8 to f7e1360 Compare July 21, 2025 16:12
@olegkachur-e olegkachur-e requested a review from potiuk July 21, 2025 16:28
@olegkachur-e olegkachur-e force-pushed the cleanup_pycache_pythonvitualenvoperator branch from f7e1360 to 946456b Compare July 21, 2025 21:39
@VladaZakharova
Copy link
Contributor

hi there!
thank you for checking this change Jarek :)
do you think there is something we can improve in this PR?
@potiuk

@olegkachur-e olegkachur-e force-pushed the cleanup_pycache_pythonvitualenvoperator branch 2 times, most recently from e44956d to 124960b Compare July 23, 2025 17:03
Extend cleanup logic for the PythonVirtualenvOperator,
as previous implementation cleans up the venv directory itself,
but kept the pycache trace.
@olegkachur-e olegkachur-e force-pushed the cleanup_pycache_pythonvitualenvoperator branch from 124960b to ce5d005 Compare July 24, 2025 12:03
@potiuk
Copy link
Member

potiuk commented Jul 24, 2025

So you mean with TemporaryDirectory(prefix="venv") as tmp_dir: potentially could result from the same name on different workers?

Yeah. with temp dir there is no risk.

@potiuk potiuk merged commit 6f1243a into apache:main Jul 24, 2025
102 checks passed
@olegkachur-e olegkachur-e deleted the cleanup_pycache_pythonvitualenvoperator branch July 25, 2025 09:24

with TemporaryDirectory(prefix="venv") as tmp_dir:
tmp_path = Path(tmp_dir)
tmp_dir, temp_venv_dir = tmp_path.relative_to(tmp_path.anchor).parts
Copy link
Member

@gopidesupavan gopidesupavan Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not guarantee always two values :)

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added fix here #54214

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not guarantee always two values :)

image

Oh, didn't expect that 😢 Thank you for the test and fix!

Copy link
Contributor Author

@olegkachur-e olegkachur-e Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, can you please share a way to reproduce it, as for with TemporaryDirectory(prefix="venv") as tmp_dir: I was expecting a path like tmp_dir_path/venv<random_str>, as both parts are 'static'. What is a way to add additional nested dirs there? @gopidesupavan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you can do it simply in local. your assumption will work inside container. but it may not work if it installed standalone in local.

you can follow these steps:

uv venv --python 3.10
source .venv/bin/activate
mkdir airflow_home
export AIRFLOW_HOME=<pwd>/airflow_home
uv pip install -U apache-airflow==3.0.4rc2 --pre --constraints https://github.com/apache/airflow/blob/constraints-3.0.4rc2/constraints-3.10.txt
airflow standalone

pwd -> repalce full path
And you can run an example with PythonVirtualenvOperator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants