-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Hi there,
Running root-based reading analysis jobs in ATLAS, we are having problems trying to understand why some jobs fail on certain sites at file open when reading remote files with xrootd. We are using ROOT version 6.18/04. (I don't think that we have problems with 6.16/00, and a few tests indicate that 6.20/06 also had this problem.)
What we see is that for a file open:
std::unique_ptr< TFile > ifile( TFile::Open( file.c_str(), "READ" ) );
on a grid site node, the job exits with status code 139, which I believe is SIGURG - Urgent condition on socket (4.2BSD).
The status code from TApplication::HandleException is 128 + root enum, and 11 is kSigUrgent.
See:
https://root.cern.ch/doc/master/TApplication_8cxx_source.html#l00602
https://root.cern.ch/doc/master/TSysEvtHandler_8h_source.html#l00107
Running the same program interactively on the same file works fine. And it seems that only some sites with remote reading are failing. So we would like to ask for help in trying to track this down.
Currently, there is no stack trace to help understand things, and a simple 'print' just after TFile::Open is not printed.
I tried to add:
gApplication->ExitOnException( TApplication::kDontExit );
thinking that https://root.cern.ch/doc/master/TApplication_8cxx_source.html#l00602
void TApplication::HandleException might throw an exception, but this does not work.
So suggestions would be welcome. Is there a way to get a stack trace or more information on what is going on in the I/O part of this file open?
I don't know how to add in watchers for people in ATLAS, or a mailing list. But I did find @krasznaa.
thanks much, RD