-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd segfault on switch boot #7179
Comments
Issue triage meeting 3/31:
|
Was not able to reproduce on another platform. However, I do not think that indicates this is platform specific given the low frequency we see this problem at. I managed to compile the systemd symbols and load in the core dump and do a little more debugging. My findings are below.... Here is the backtrace from gdb
The cause of the segfault is that the job in question being queued does not have a valid unit associated with it. The system attempted to get the manager of the unit Looking through the code, I cannot find a good reason behind this occurring. What is even more troubling is that the job in question has been installed and queued to start (as can be seen by the existence of Dump of the job....
Dump of the job's unit...
Perhaps someone better versed in systemd could take a look here. Overall I'm not optimistic about locating the root cause without reproducing it and attaching gdb to a live system (which we have found very difficult to do). My suspicion here is a memory issue due to the invalid return on the job. There are also a few locations in the core dump which are showing as invalid as well. Either systemd has a leak where it is overwriting its own memory or we may have had a error in the hardware causing an overwrite. Aforementioned errors...
|
Description
Systemd was observed to crash on the boot of a Mellanox MSN2100 switch running SONiC 202012. During boot it reports a segfault and from that point no further sonic services / containers are able to start.
It is worth noting that the segfault was immediately preceded by starting the
what-just-happened.timer
unit.Here are the adjacent syslogs...
I have attached the systemd core dump and
show techsupport
to this issue.systemd.1616793731.21064.core.gz
sonic_dump_r-bulldog-02_20210329_195332.tar.gz
Steps to reproduce the issue:
This issue seems to occur with a very small probability on switch boot. Initially observed on a Mellanox MSN2100. No particular conditions have been observed to trigger this.
Describe the results you received:
Systemd segfaults with:
This makes multiple services critical to sonic such as swss unable to start.
Describe the results you expected:
No segfault.
Output of
show version
:The text was updated successfully, but these errors were encountered: