Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xrootd file open on the grid sometimes fail with status code 139 #6948

Closed
rdschaffer opened this issue Dec 14, 2020 · 117 comments
Closed

xrootd file open on the grid sometimes fail with status code 139 #6948

rdschaffer opened this issue Dec 14, 2020 · 117 comments
Assignees

Comments

@rdschaffer
Copy link

Hi there,

Running root-based reading analysis jobs in ATLAS, we are having problems trying to understand why some jobs fail on certain sites at file open when reading remote files with xrootd. We are using ROOT version 6.18/04. (I don't think that we have problems with 6.16/00, and a few tests indicate that 6.20/06 also had this problem.)

What we see is that for a file open:

    std::unique_ptr< TFile > ifile( TFile::Open( file.c_str(), "READ" ) );

on a grid site node, the job exits with status code 139, which I believe is SIGURG - Urgent condition on socket (4.2BSD).
The status code from TApplication::HandleException is 128 + root enum, and 11 is kSigUrgent.
See:
https://root.cern.ch/doc/master/TApplication_8cxx_source.html#l00602
https://root.cern.ch/doc/master/TSysEvtHandler_8h_source.html#l00107

Running the same program interactively on the same file works fine. And it seems that only some sites with remote reading are failing. So we would like to ask for help in trying to track this down.

Currently, there is no stack trace to help understand things, and a simple 'print' just after TFile::Open is not printed.

I tried to add:

gApplication->ExitOnException( TApplication::kDontExit );

thinking that https://root.cern.ch/doc/master/TApplication_8cxx_source.html#l00602

void TApplication::HandleException might throw an exception, but this does not work.

So suggestions would be welcome. Is there a way to get a stack trace or more information on what is going on in the I/O part of this file open?

I don't know how to add in watchers for people in ATLAS, or a mailing list. But I did find @krasznaa.

                          thanks much, RD
@krasznaa
Copy link
Contributor

Unfortunately this is the sort of issue that could have been easier to track/discuss on JIRA. But since ROOT doesn't use that anymore, here we go...

My suspicion is that the grid nodes in question put some locally installed XRootD version high up in the library search path of the jobs. I don't know how they would do that, but that's my educated guess.

ATLAS analysis releases using ROOT 6.18/04 (https://gitlab.cern.ch/atlas/atlasexternals/-/blob/1.0.65/External/ROOT/CMakeLists.txt) use XRootD 4.10.0 (https://gitlab.cern.ch/atlas/atlasexternals/-/blob/1.0.65/External/XRootD/CMakeLists.txt). While releases using ROOT 6.16/00 (https://gitlab.cern.ch/atlas/atlasexternals/-/blob/1.0.60/External/ROOT/CMakeLists.txt) used XRootD 4.8.4 (https://gitlab.cern.ch/atlas/atlasexternals/-/blob/1.0.60/External/XRootD/CMakeLists.txt). My educated guess is that the XRootD version force fed into your jobs @rdschaffer is binary compatible with XRootD 4.8.4, but not with 4.10.0 (or newer).

However we definitely need some follow up from our grid experts on this. @rodwalker would it be possible to look at the problematic jobs / grid nodes for this?

Cheers,
Attila

@rodwalker
Copy link

rodwalker commented Dec 14, 2020 via email

@krasznaa
Copy link
Contributor

Hi Rod,

What does

LD_PRELOAD=/srv/workDir/96340ef3-75b1-46cf-8910-8a2f76b7068c/$LIB/wrapper.so

do? That would be my first suspect. Since $LD_LIBRARY_PATH lists our software directories in the correct order, based on just that XRootD should be found under:

[bash][thor]:~ > ls -l /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrd*
lrwxrwxrwx 1 cvmfs cvmfs      19 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdAppUtils.so -> libXrdAppUtils.so.1
lrwxrwxrwx 1 cvmfs cvmfs      23 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdAppUtils.so.1 -> libXrdAppUtils.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   74512 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdAppUtils.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   18432 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdBlacklistDecision-4.so
-rwxr-xr-x 1 cvmfs cvmfs   82136 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdBwm-4.so
-rwxr-xr-x 1 cvmfs cvmfs   13552 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCksCalczcrc32-4.so
lrwxrwxrwx 1 cvmfs cvmfs      17 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdClient.so -> libXrdClient.so.2
lrwxrwxrwx 1 cvmfs cvmfs      21 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdClient.so.2 -> libXrdClient.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs  663320 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdClient.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs   42096 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdClProxyPlugin-4.so
lrwxrwxrwx 1 cvmfs cvmfs      13 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCl.so -> libXrdCl.so.2
lrwxrwxrwx 1 cvmfs cvmfs      17 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCl.so.2 -> libXrdCl.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs 1416944 Sep 10 03:20 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCl.so.2.0.0
lrwxrwxrwx 1 cvmfs cvmfs      21 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCryptoLite.so -> libXrdCryptoLite.so.1
lrwxrwxrwx 1 cvmfs cvmfs      25 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCryptoLite.so.1 -> libXrdCryptoLite.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   13632 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCryptoLite.so.1.0.0
lrwxrwxrwx 1 cvmfs cvmfs      17 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCrypto.so -> libXrdCrypto.so.1
lrwxrwxrwx 1 cvmfs cvmfs      21 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCrypto.so.1 -> libXrdCrypto.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs  129112 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCrypto.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs  222064 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdCryptossl-4.so
lrwxrwxrwx 1 cvmfs cvmfs      14 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdFfs.so -> libXrdFfs.so.2
lrwxrwxrwx 1 cvmfs cvmfs      18 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdFfs.so.2 -> libXrdFfs.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs   65152 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdFfs.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs  271416 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdFileCache-4.so
-rwxr-xr-x 1 cvmfs cvmfs   13104 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdHttp-4.so
-rwxr-xr-x 1 cvmfs cvmfs  115880 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdHttpTPC-4.so
lrwxrwxrwx 1 cvmfs cvmfs      20 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdHttpUtils.so -> libXrdHttpUtils.so.1
lrwxrwxrwx 1 cvmfs cvmfs      24 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdHttpUtils.so.1 -> libXrdHttpUtils.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs  206640 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdHttpUtils.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   18824 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdN2No2p-4.so
-rwxr-xr-x 1 cvmfs cvmfs   13304 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdOssSIgpfsT-4.so
lrwxrwxrwx 1 cvmfs cvmfs      23 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdPosixPreload.so -> libXrdPosixPreload.so.1
lrwxrwxrwx 1 cvmfs cvmfs      27 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdPosixPreload.so.1 -> libXrdPosixPreload.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   87568 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdPosixPreload.so.1.0.0
lrwxrwxrwx 1 cvmfs cvmfs      16 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdPosix.so -> libXrdPosix.so.2
lrwxrwxrwx 1 cvmfs cvmfs      20 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdPosix.so.2 -> libXrdPosix.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs  195944 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdPosix.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs 1001552 Sep 10 03:26 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdProofd.so
-rwxr-xr-x 1 cvmfs cvmfs   83216 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdPss-4.so
-rwxr-xr-x 1 cvmfs cvmfs   70544 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSec-4.so
-rwxr-xr-x 1 cvmfs cvmfs  220600 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSecgsi-4.so
-rwxr-xr-x 1 cvmfs cvmfs   19480 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSecgsiAUTHZVO-4.so
-rwxr-xr-x 1 cvmfs cvmfs   23808 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSecgsiGMAPDN-4.so
-rwxr-xr-x 1 cvmfs cvmfs   53384 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSeckrb5-4.so
-rwxr-xr-x 1 cvmfs cvmfs   25152 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSecProt-4.so
-rwxr-xr-x 1 cvmfs cvmfs  142864 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSecpwd-4.so
-rwxr-xr-x 1 cvmfs cvmfs   45192 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSecsss-4.so
-rwxr-xr-x 1 cvmfs cvmfs   19320 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSecunix-4.so
lrwxrwxrwx 1 cvmfs cvmfs      17 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdServer.so -> libXrdServer.so.2
lrwxrwxrwx 1 cvmfs cvmfs      21 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdServer.so.2 -> libXrdServer.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs 1040472 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdServer.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs  134808 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsi-4.so
lrwxrwxrwx 1 cvmfs cvmfs      17 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsiLib.so -> libXrdSsiLib.so.1
lrwxrwxrwx 1 cvmfs cvmfs      21 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsiLib.so.1 -> libXrdSsiLib.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs  161352 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsiLib.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   18544 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsiLog-4.so
lrwxrwxrwx 1 cvmfs cvmfs      19 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsiShMap.so -> libXrdSsiShMap.so.1
lrwxrwxrwx 1 cvmfs cvmfs      23 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsiShMap.so.1 -> libXrdSsiShMap.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   39624 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdSsiShMap.so.1.0.0
-rwxr-xr-x 1 cvmfs cvmfs   76664 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdThrottle-4.so
lrwxrwxrwx 1 cvmfs cvmfs      16 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdUtils.so -> libXrdUtils.so.2
lrwxrwxrwx 1 cvmfs cvmfs      20 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdUtils.so.2 -> libXrdUtils.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs  763032 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdUtils.so.2.0.0
lrwxrwxrwx 1 cvmfs cvmfs      14 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdXml.so -> libXrdXml.so.2
lrwxrwxrwx 1 cvmfs cvmfs      18 Sep 10 13:12 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdXml.so.2 -> libXrdXml.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs  122928 Sep 10 03:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdXml.so.2.0.0
-rwxr-xr-x 1 cvmfs cvmfs   13104 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdXrootd-4.so
[bash][thor]:~ >

Do you know what that preload is (supposed to be) doing exactly?

Cheers,
Attila

@rodwalker
Copy link

rodwalker commented Dec 14, 2020 via email

@rdschaffer
Copy link
Author

Hey @Axel-Naumann,

Have you found a moment to have a look at this?

    see you, RD

@Axel-Naumann
Copy link
Member

@simonmichal would you have a recommendation what to look at?

@simonmichal
Copy link

Well, I doubt there are some out-of-band data being sent/received. @rodwalker, @rdschaffer would it be possible to reproduce the problem with xrootd client logs switched on (XRD_LOGLEVEL=Dump)?

Regarding ABI compatibility, we ensure ABI forward compatibility, meaning that it is safe to link an application built with an older version of xrootd, with a newer version of the library (e.g. one can build his application with say 4.11.0 and then link with 4.12.0). The opposite is not possible. Of course this applies to all releases from 4.x.x series, the ABI has been broken when we moved to XRootD5.

@rdschaffer
Copy link
Author

OK, I ran with XRD_LOGLEVEL=Dump, and you can see the response after

=== stderr ===

saying:

Unable to process directory /alrb/.xrootd/client.plugins.d: [ERROR] OS Error: No such file or directory

Log file:

xrootd_error_on_grid.pdf

The file:

root://marsedpm.in2p3.fr:1094//dpm/in2p3.fr/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/ab/DAOD_HIGG2D1.23315577._000001.pool.root.1

of course opens correctly for a simple TOpen in any interactive ROOT session.

          see you, RD

@rdschaffer
Copy link
Author

The above is running in Marseilles: CCIN2P3-CCPM.

Another for reading from eos from the CERN-T0 facility:

xrootd_error_on_grid_CERN_T0.pdf

@simonmichal
Copy link

Is it possible to determine the exact version of xrootd client that is being used? Unfortunately, the crash happens before the client logs in so I cannot see it from logs. The server reported protocol version 500 in the xrootd handshake.

@rdschaffer
Copy link
Author

Does this help:

2020-12-16 12:22:18,612 | INFO | Thread-1 | gfal2 | connect | [gfal_module_load] plugin /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/emi/4.0.2-1_200423.fix1/usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success

@rdschaffer
Copy link
Author

Marseilles job logs are in:

marseilles

and Cern jobs logs are in:
Cern

@rodwalker
Copy link

rodwalker commented Dec 16, 2020 via email

@krasznaa
Copy link
Contributor

Hi Rod,

😕 So, how did you compile that code exactly? Just g++ main.cxx, right?

In that case XRootD would be picked up from /usr. Which doesn't tell us much about our problem. Since RD's test job will pick up XRootD from:

/cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/

This is why I said at the beginning, that I'm suspicious about the LD_PRELOAD setting. If that library wants to use XRootD, but it was compiled against a different version of XRootD than what the analysis release comes with, then we're in trouble. Note that all ATLAS releases come with their own version of XRootD, not just the analysis releases. So any grid node setup that wants to force one particular version of XRootD on the job, will give us a really bad time...

Best,
Attila

@rodwalker
Copy link

rodwalker commented Dec 16, 2020 via email

@krasznaa
Copy link
Contributor

Hmm... That in principle looks fine... So okay, your test job is relevant.

Unfortunately I'm running out of ideas. The XRootD build in AnalysisBaseExternals does depend on a couple of libraries from the OS. But these should only be things that are part of HEP_OSlibs. So the worker nodes should not really have different versions of them...

[bash][lxplus730]:~ > ldd -r /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrd*.so | grep " /lib" | sed "s/\(.*\) (0x.*)/\1/g" | sort | uniq 
	libc.so.6 => /lib64/libc.so.6
	libcom_err.so.2 => /lib64/libcom_err.so.2
	libcrypt.so.1 => /lib64/libcrypt.so.1
	libcrypto.so.10 => /lib64/libcrypto.so.10
	libcurl.so.4 => /lib64/libcurl.so.4
	libdl.so.2 => /lib64/libdl.so.2
	libfreebl3.so => /lib64/libfreebl3.so
	libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2
	libidn.so.11 => /lib64/libidn.so.11
	libk5crypto.so.3 => /lib64/libk5crypto.so.3
	libkeyutils.so.1 => /lib64/libkeyutils.so.1
	libkrb5.so.3 => /lib64/libkrb5.so.3
	libkrb5support.so.0 => /lib64/libkrb5support.so.0
	liblber-2.4.so.2 => /lib64/liblber-2.4.so.2
	libldap-2.4.so.2 => /lib64/libldap-2.4.so.2
	libm.so.6 => /lib64/libm.so.6
	libnspr4.so => /lib64/libnspr4.so
	libnss3.so => /lib64/libnss3.so
	libnssutil3.so => /lib64/libnssutil3.so
	libpcre.so.1 => /lib64/libpcre.so.1
	libplc4.so => /lib64/libplc4.so
	libplds4.so => /lib64/libplds4.so
	libpthread.so.0 => /lib64/libpthread.so.0
	libresolv.so.2 => /lib64/libresolv.so.2
	librt.so.1 => /lib64/librt.so.1
	libsasl2.so.3 => /lib64/libsasl2.so.3
	libselinux.so.1 => /lib64/libselinux.so.1
	libsmime3.so => /lib64/libsmime3.so
	libssh2.so.1 => /lib64/libssh2.so.1
	libssl.so.10 => /lib64/libssl.so.10
	libssl3.so => /lib64/libssl3.so
	libz.so.1 => /lib64/libz.so.1
[bash][lxplus730]:~ >

Could the version of some of these not be "well defined" on the grid nodes?

@rodwalker
Copy link

rodwalker commented Dec 16, 2020 via email

@rdschaffer
Copy link
Author

rdschaffer commented Dec 16, 2020

Hi Rod,

Well, I added a 'print' before and after 1244 in the current jobs - didn't check it it. So this looks like:

      ATH_MSG_INFO( "processEvents: try to open file: " << file );

      std::unique_ptr< TFile > ifile( TFile::Open( file.c_str(), "READ" ) );

      ATH_MSG_INFO( "processEvents: called TFile Open " );

and in the log, one sees:

`H4lAnalRun2 INFO processEvents: try to open file: root://eosatlas.cern.ch:1094//eos/atlas/atlasdatadisk/rucio/mc16_13TeV/25/31/DAOD_HIGG2D1.23315648._000001.pool.root.1

=== stderr ===
[2020-12-16 13:29:01.003032 +0100][Debug ][Utility ] Unable to process user config file: [ERROR] OS Error: No such file or directory
[2020-12-16 13:29:01.018152 +0100][Debug ][PlugInMgr ] Initializing plug-in manager...
[2020-12-16 13:29:01.018254 +0100][Debug ][PlugInMgr ] No default plug-in, loading plug-in configs...
[2020-12-16 13:29:01.018302 +0100][Debug ][PlugInMgr ] Processing plug-in definitions in /etc/xrootd/client.plugins.d...
[2020-12-16 13:29:01.020375 +0100][Debug ][PlugInMgr ] Processing plug-in definitions in /alrb/.xrootd/client.plugins.d...
[2020-12-16 13:29:01.020433 +0100][Debug ][PlugInMgr ] Unable to process directory /alrb/.xrootd/client.plugins.d: [ERROR] OS Error: No such file or directory
[2020-12-16 13:29:02.298776 +0100][Dump ][Utility ] URL: root://eosatlas.cern.ch//eos/atlas/atlasdatadisk/rucio/mc16_13TeV/25/31/DAOD_HIGG2D1.23315648._000001.pool.root.1
`

So one sees the 'try to open file', then there is the TFile::Open, and nothing else. So I conclude that this is coming from the Open.

     see you, RD

@rdschaffer
Copy link
Author

rdschaffer commented Dec 17, 2020

Well, one thing that is clear is that this problem seems to be associated with specific sites. For my 'test' job:

test job

The sites that are successful either have local reading, or they use xrootd without problems. The latter are:
SWT2_CPB
IN2P3-LPSC_LAKE
RAL

For the failures, these are all just xrootd problems, at sites:
IN2P3-CPPM
CERN-T0
TOKYO
BNL

So I would suspect some difference in the xrootd installation between these two sites. (I personally have no idea how to check this.)

@simonmichal
Copy link

simonmichal commented Dec 17, 2020

@rdschaffer : could you add following code to your job:

#include <link.h>
#include <stdlib.h>
#include <stdio.h>

static int
callback(struct dl_phdr_info *info, size_t size, void *data)
{
    int j;

   printf("name=%s (%d segments)\n", info->dlpi_name,
        info->dlpi_phnum);

   for (j = 0; j < info->dlpi_phnum; j++)
         printf("\t\t header %2d: address=%10p\n", j,
             (void *) (info->dlpi_addr + info->dlpi_phdr[j].p_vaddr));
    return 0;
}

and then at the beginning of your main:

dl_iterate_phdr(callback, NULL);

This will print paths of all the loaded shared libraries to stdout.

@rdschaffer
Copy link
Author

Hi @simonmichal,

Jobs are running. For a "failed" job at our CERN T0 reading from eos, have a look here. Let me know if you don't have access, and I'll make a pdf file.

        see you, RD

@rdschaffer
Copy link
Author

I don't see libXrxxx in the list. Would this appear later after a request to xrootd has been made?

I put the call at the very beginning, as suggested:

int main( int argc, char* argv[] ) {
// setup callback to debugging xrootd problem - RDS 2020/12
dl_iterate_phdr(callback, NULL);

 see you, RD

@simonmichal
Copy link

@rodwalker : hmm, let me dwell on this for a minute ...

@rdschaffer
Copy link
Author

And here is a job output from RAL where xrootd seems to work.

More generally, here is a jobset which has jobs which both succeed at some sites with xrootd or file staging, and the fails with xrootd at other sites.

@simonmichal
Copy link

I'm bit puzzled here, if I link dummy main with libNetxNG.so and print out all used shared libs with dl_iterate_phdr the output includes xrootd libs. @rdschaffer : could you try moving the dl_iterate_phdr just before the open request gets issued, maybe this will help (fingers crossed)?

@rodwalker
Copy link

rodwalker commented Jan 14, 2021 via email

@rodwalker
Copy link

rodwalker commented Jan 14, 2021 via email

@simonmichal
Copy link

simonmichal commented Jan 14, 2021

@rodwalker : before examining the core dump could you install xrootd-debuginfo and then when in gdb could you go to frame 4 (type: f 4), you should be then in this source file:

https://github.com/xrootd/xrootd/blob/93871f8241e478a308c8e722fd99aeaa08ff6459/src/XrdNet/XrdNetAddr.cc#L268

could you then print the value of iP variable?

@Axel-Naumann
Copy link
Member

@simonmichal XrdNetUtils::MyHostName() does getaddrinfo() for the local iface? Looks like that fails here? Okay stepping back to the sideline to watch ;-)

@rodwalker
Copy link

rodwalker commented Jan 14, 2021 via email

@simonmichal
Copy link

@rodwalker : yes, it should be enough, thanks a lot

@simonmichal
Copy link

@rodwalker : against which version of xrootd was the executable built? the one on lxplus or some version from cvmfs?

@rdschaffer
Copy link
Author

Hi @simonmichal,

We build against cvmfs. Attila gave the versions above, use XRootD 4.10.0, in his Dec 14th entry.

@rdschaffer
Copy link
Author

-rwxr-xr-x 1 cvmfs cvmfs 13104 Sep 10 03:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.2/AnalysisBaseExternals/21.2.139/InstallArea/x86_64-centos7-gcc8-opt/lib/libXrdXrootd-4.so

(from above)

@rdschaffer
Copy link
Author

Could it be that the iP obtained from the 'host name' which in the log seems to be:

Attempting connection to [::ffff:10.42.38.55]:1096

will have iP == '[' ?? Looks like the name before the first ':' is iP, if I am not mistaken...

@simonmichal
Copy link

it could be also the the call to gethostname failed and there's simply garbage:
https://github.com/xrootd/xrootd/blob/master/src/XrdNet/XrdNetUtils.cc#L622

@Axel-Naumann : are there debug symbols on cvmfs? were the libs rebuild or installed from EPEL?

@rdschaffer
Copy link
Author

Right. It is trying to get the local host name, not for the file...

@rdschaffer
Copy link
Author

Rob found: acas1035.usatlas.bnl.gov for gethostname in some c-code... Should he also try to call getaddrinfo? Not sure is hostHints is default or not.

@simonmichal
Copy link

Well, it segvs in here: https://github.com/xrootd/xrootd/blob/stable-4.12.x/src/XrdNet/XrdNetAddr.cc#L268, right?
iP is either a hSpec or a substring of hSpec, which is the buffer filled in by gethostname in here: https://github.com/xrootd/xrootd/blob/master/src/XrdNet/XrdNetUtils.cc#L622
hostHints is a global static defined here: https://github.com/xrootd/xrootd/blob/stable-4.12.x/src/XrdNet/XrdNetAddr.cc#L79, it must be valid pointer.

For now, the only scenario where something could go wrong, that I see is when gethostname fails.

That said, there are not so many reasons for gethostname to fail:

       EFAULT name is an invalid address.

       EINVAL len is negative or, for sethostname(), len is larger than
              the maximum allowed size.

       ENAMETOOLONG
              (glibc gethostname()) len is smaller than the actual size.
              (Before version 2.1, glibc uses EINVAL for this case.)

       EPERM  For sethostname(), the caller did not have the
              CAP_SYS_ADMIN capability in the user namespace associated
              with its UTS namespace (see namespaces(7)).

@rdschaffer
Copy link
Author

Is there a way to see the variable values in the core with gdb? I don't think that we can understand this without seeing them. A simple gethostname works properly, as expected.

@rodwalker
Copy link

rodwalker commented Jan 14, 2021 via email

@simonmichal
Copy link

To inspect the variables we need debug symbols, they need to come from the same build because gdb is validating the crc.

@rdschaffer
Copy link
Author

@krasznaa, is it possible to make a dbg build for xrootd that would be available on cvmfs? Not sure if this is easy.

@simonmichal
Copy link

Normally the release build is stripped from debug symbols and they are installed in a separate location (e.g. /usr/lib/debug), you guys don't do this for cvmfs builds?

@Axel-Naumann
Copy link
Member

@gganis @peremato would you know whether the xrootd libraries have their symbols stripped, or who might know?

@rodwalker
Copy link

rodwalker commented Jan 15, 2021 via email

@rdschaffer
Copy link
Author

Hi there,

OK, I think that I have found the culprit, but I don't understand the reasons. The main difference of our current 'minitree production' jobs this round is that we included a build of the MCFM physics generator with the build of our analysis code and we use it while running to calculate physics matrix elements for each event.

So to test this, I rebuilt our analysis code without building MCFM, and of course no longer calculate the matrix elements. I submitted this to the BNL site, and we read the files fine:

H4lAnalRun2 INFO processEvents: try to open file: root://dcgftp.usatlas.bnl.gov:1096//pnfs/usatlas.bnl.gov/BNLT0D1/rucio/mc16_13TeV/84/1f/DAOD_HIGG2D1.23315636._000001.pool.root.1
H4lAnalRun2 INFO processEvents: called TFile Open
H4lAnalRun2 INFO processEvents - opened file 0 root://dcgftp.usatlas.bnl.gov:1096//pnfs/usatlas.bnl.gov/BNLT0D1/rucio/mc16_13TeV/84/1f/DAOD_HIGG2D1.23315636._000001.pool.root.1
H4lAnalRun2 INFO notifyNewFile: Entering

which as you may remember is the TFile::Open on a file which would use xrootd access. The job continues fine, reading 6 files, as one would expect.

Now there is no matrix element calculated before we start reading the events. So it must be that somehow linking in the MCFM libraries causes problems for calling the gethostname. I must admit that I have no idea how/why this would 'interfere', since MCFM is not run at all before the TFile::Open.

So I think that we can let this bug report rest for now. If anyone might have ideas on how to check or fix the MCFM problem, suggestions are welcome. But I no longer think that xrootd has a problem. This is clearly a problem in how we have set up our client code.

Thanks all for your time spent on this!

     see you, RD

@Axel-Naumann
Copy link
Member

@simonmichal just FYI - it's not that _nss_dns_gethostbyname4_r (gethostbyname) fails - it crashes!

@rdschaffer my first guess would be a stack exhaustion. You can check with changing the gdb -ex invocation from thread apply all bt to print (char *)_environ - (char *)$sp; or experimentally by setting the thread stack size to something very high, ulimit -s 67108864. Another cause might be symbols exported from the MCMF library that it shouldn't and that interfere with glibc; you can check with nm --defined-only -g libMCMF.so |grep -v ' [Wug] ' | grep -v ' _Z' | less (or whatever the MCMF library is called).

@simonmichal
Copy link

@Axel-Naumann : correct, just to clarify my theory, _nss_dns_gethostbyname4_r is called internally by getaddrinfo and the buffer handed over to getaddrinfo is initialised by gethostname. I was suspecting that the later might have failed and as a result a buffer full of junk has been passed to getaddrinfo causing a segv (though, I realise the theory was rather vague ;-).

Anyway, I'm glad it has been sorted out :-)

@krasznaa
Copy link
Contributor

Re-joining the discussion a bit late...

Installing debug symbols for our analysis releases on CVMFS would be pretty difficult. Our builds do produce a separate RPM for the debug symbols of our own code. (Though we didn't even use that machinery for the analysis releases yet.) But when we build XRootD for our standalone analysis release, we don't bother with the "RelWithDebInfo" CMake build mode.

https://gitlab.cern.ch/atlas/atlasexternals/-/blob/1.0/External/XRootD/CMakeLists.txt#L55-60

This is because every "external project" has a different implementation for this. And coding up how we would produce just one "ATLAS RPM" that contains just the debug symbol files for all the externals seemed way too much trouble. For such debugging we would use a full-on Debug build instead.

But the more relevant thing: Does XRootD, or any of the I/O libraries that it uses, make use of OpenMP? Putting aside all the weird linking issue possibilities, the one unusual thing that RD's MCFM build does is that it sets the following environment variable for the jobs:

export OMP_STACKSIZE=16000

Since apparently MCFM does use OpenMP. (This I checked.) Though I don't know why this variable would need to be set manually.

With a quick Google search I saw that for instance gfal, at least at one point, used OpenMP. So I wonder if maybe OpenMP is responsible for something here. It shouldn't interfere with ROOT's usage of TBB (at least I don't think so), but maybe with some I/O library?

@simonmichal
Copy link

@krasznaa : no, we don't use OpenMP in xrootd.

@krasznaa
Copy link
Contributor

@rdschaffer, it could be interesting to try running your job at BNL with:

  • Removing the setting of OMP_STACKSIZE all together;
  • Setting it to a different / larger value.

For both of these of course you need to edit

https://gitlab.cern.ch/HZZ/HZZSoftware/HZZTools/-/blob/master/MCFM_MatrixElement/MCFMEnvironmentConfig.cmake#L11

in your submission directory.

It's likely that OpenMP is just a red herring, but this way at least we would know for sure.

@rdschaffer
Copy link
Author

@krasznaa neither removing OMP_STACKSIZE, nor setting to 32000 has any effect. So we can rule out this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants