Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ansible fails with too many open files on Fedora #280

Closed
opoplawski opened this issue Jun 21, 2018 · 10 comments
Closed

ansible fails with too many open files on Fedora #280

opoplawski opened this issue Jun 21, 2018 · 10 comments

Comments

@opoplawski
Copy link
Contributor

python(mitogen): ansible-stat Error Executing CMD:/bin/file -i /etc/conda/condarc.d/nwra.yml Exception:[Errno 24] Too many open files

Not sure where the ultimate fault lies - possibly in ansilble or dnf - but because mitogen produces a long running process unclosed file descriptors build up. I'm seeing lots of:

python  7383 root  259u  a_inode               0,13         0     13160 [eventpoll]
python  7383 root  260r      REG              253,0      1895   1845601 /var/cache/dnf/google-talkplugin-filenames.solvx
python  7383 root  261r      REG              253,0      8538   1843503 /var/cache/dnf/google-talkplugin.solv
python  7383 root  262r      REG              253,0  46645814   1839594 /var/cache/dnf/fedora-filenames.solvx
python  7383 root  263r      REG              253,0  20187911   1856174 /var/cache/dnf/fedora.solv
python  7383 root  264r      REG              253,0    359442   1857900 /var/cache/dnf/rpmfusion-free-updates-filenames.solvx
python  7383 root  265r      REG              253,0    152382   1857265 /var/cache/dnf/rpmfusion-free-updates.solv
python  7383 root  266r      REG              253,0       676   1839631 /var/cache/dnf/adobe-linux-x86_64-filenames.solvx
python  7383 root  267r      REG              253,0      8768   1930812 /var/cache/dnf/adobe-linux-x86_64.solv
python  7383 root  268r      REG              253,0    523778   1839617 /var/cache/dnf/rpmfusion-free-filenames.solvx
python  7383 root  269r      REG              253,0    341320   1857606 /var/cache/dnf/rpmfusion-free.solv
python  7383 root  270r      REG              253,0   1213511   1930814 /var/cache/dnf/nwrpms-filenames.solvx
python  7383 root  271r      REG              253,0     61700   1839452 /var/cache/dnf/nwrpms.solv
python  7383 root  272r      REG              253,0  19680652   1851466 /var/cache/dnf/updates-filenames.solvx
python  7383 root  273r      REG              253,0   5551752   1841845 /var/cache/dnf/updates.solv
python  7383 root  274r      REG              253,0   9018430   1845622 /var/cache/dnf/updates-updateinfo.solvx
python  7383 root  275r      REG              253,0     52275   1860547 /var/cache/dnf/intel-psxe-2018.2-filenames.solvx
python  7383 root  276r      REG              253,0     33255   1857619 /var/cache/dnf/intel-psxe-2018.2.solv
python  7383 root  277r      REG              253,0      2170   1845605 /var/cache/dnf/petersen-pandoc-filenames.solvx
python  7383 root  278r      REG              253,0     11608   1857624 /var/cache/dnf/petersen-pandoc.solv
python  7383 root  279r      REG              253,0      1840   1930821 /var/cache/dnf/google-chrome-filenames.solvx
python  7383 root  280r      REG              253,0      9449   1857853 /var/cache/dnf/google-chrome.solv
python  7383 root  281r      REG              253,0      1536   1930815 /var/cache/dnf/skype-stable-filenames.solvx
python  7383 root  282r      REG              253,0     11185   1846262 /var/cache/dnf/skype-stable.solv

Now to try to figure out a simple reproducer...

@dw
Copy link
Member

dw commented Jun 21, 2018

I guess this is happening on the target machine? If you set "mitogen_task_isolation=fork" I presume it will abate. Just setting up a CentOS VM now. Would love an example task that lead to ^ and e.g. how many times it was executed.

@dw
Copy link
Member

dw commented Jun 21, 2018

It looks like either Ansible's dnf.py isn't calling a finalization method of dnf, or dnf isn't calling a finalization method of the underlying rpm library. DNF doesn't appear to directly manipulate those solvx files, it's (seemingly?) handled only by the RPM library.

@dw
Copy link
Member

dw commented Jun 21, 2018

Ansible dnf.py is definitely missing a call to dnf.Base.close(), it's relying on del to call it, which is wrong to begin with, but easily defeated if /anything/ else holds a reference to the Base instance.

99% sure mitogen_task_isolation=fork on all your dnf tasks will cure this, but that's not a real fix

@opoplawski
Copy link
Contributor Author

Yeah, it happens on the target machine. I suspect something like:

- dnf: name="{{ item }}" state=present
  with_items:
  - A LONG LIST OF ITEMS

Would do the trick to reproduce it. I suspect we're initializing a new dnf state of some kind each time, and getting a new set of the above open files. Target machines are Fedora 27 and 28. Worth reporting to ansible, or do they not care about module cleanup at this point?

@opoplawski
Copy link
Contributor Author

Okay, I added a base.close() in dnf.py before every module.{exit,fail}_json() call and that helps. Still leaking some file descriptors, but not as bad:

python  46996 root   47u  a_inode               0,13         0     13160 [eventpoll]
python  46996 root   48u     unix 0x0000000004b6e5b0       0t0 118520056 type=STREAM
python  46996 root   49u     unix 0x000000002994b310       0t0 118520057 type=STREAM
python  46996 root   50w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   51u     unix 0x000000009ae77b60       0t0 118520058 type=STREAM
python  46996 root   52u     unix 0x00000000045eb707       0t0 118520059 type=STREAM
python  46996 root   53w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   54u     unix 0x0000000085fbe64d       0t0 118520060 type=STREAM
python  46996 root   55u     unix 0x0000000089e7dd5f       0t0 118520061 type=STREAM
python  46996 root   56u  a_inode               0,13         0     13160 [eventpoll]
python  46996 root   57u     unix 0x000000005d9c8f9a       0t0 118512508 type=STREAM
python  46996 root   58u     unix 0x000000003efc2ef9       0t0 118512509 type=STREAM
python  46996 root   59w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   60u     unix 0x00000000aca4149b       0t0 118517224 type=STREAM
python  46996 root   61u     unix 0x00000000a07992d7       0t0 118517225 type=STREAM
python  46996 root   62u  a_inode               0,13         0     13160 [eventpoll]
python  46996 root   63u     unix 0x0000000001e045d6       0t0 118520062 type=STREAM
python  46996 root   64u     unix 0x000000001b59a0f9       0t0 118520063 type=STREAM
python  46996 root   65w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   66w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   67w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   68w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   69w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   70w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   71u  a_inode               0,13         0     13160 [eventpoll]
python  46996 root   72w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   73w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   74w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   75w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   76w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   77w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   78w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   79w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   80w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   81w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   82w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   83w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   84w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   85r      REG              253,0   6406312   1857598 /var/lib/sss/mc/group
python  46996 root   86w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   87w      REG              253,0  12356876   1839618 /var/log/hawkey.log
python  46996 root   88w      REG              253,0  12356876   1839618 /var/log/hawkey.log

lots of modules involved above...

@dw
Copy link
Member

dw commented Jun 21, 2018

hawkey.log looks like it belongs to https://github.com/rpm-software-management/hawkey which is a binding for managing .solv files. Looks like DNF itself might be a little suspect

@opoplawski
Copy link
Contributor Author

Filed ansible/ansible#41810

@dw
Copy link
Member

dw commented Jun 22, 2018

/var/log/hawkey.log is closed by hy_sack_free(), which is called by the hawkey.Sac's tp_dealloc in the Python extension. This means the log file will never be closed until the last reference to the sack is destroyed.

Looking at dnf/base.py we see def reset clearing out the reference to the sack, but clearly it persists elsewhere, and doesn't seem to be due to a reference cycle since forcing a GC doesn't close it.

I'm going to have another pass to find where those references leak, but even if we find them and can work around it, there is no promise they won't sneak back later -- this library's design is somewhat lacking in places.

A real fix to this likely requires API changes in hawkey, meanwhile forking will need to be enabled automatically for existing deployed versions when this module is executed.

@opoplawski
Copy link
Contributor Author

You've dug much deeper than I was willing to. FWIW - I've filed https://bugzilla.redhat.com/show_bug.cgi?id=1594016 against dnf if you want to pass along your findings.

dw added a commit that referenced this issue Jun 22, 2018
It just generates far too much spam, and its final decision is obvious
since a followup load_module() will exist for positive matches.
@dw
Copy link
Member

dw commented Jun 22, 2018

Thanks for all your help with this! As it impacts existing versions of Ansible, the solution I've committed will be needed until the relevant versions of hawkey and DNF disappear (i.e. circa 2028 ;)). It's the first case where always-fork is definitely required, been expecting it for some time.

@dw dw closed this as completed Jun 22, 2018
PatrickCoakley23 pushed a commit to cyara/mitogen that referenced this issue Nov 10, 2023
adding new server configration files for El Salvador
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants