Skip to content

xrootd file open on the grid sometimes fail with status code 139 #6948

@rdschaffer

Description

@rdschaffer

Hi there,

Running root-based reading analysis jobs in ATLAS, we are having problems trying to understand why some jobs fail on certain sites at file open when reading remote files with xrootd. We are using ROOT version 6.18/04. (I don't think that we have problems with 6.16/00, and a few tests indicate that 6.20/06 also had this problem.)

What we see is that for a file open:

    std::unique_ptr< TFile > ifile( TFile::Open( file.c_str(), "READ" ) );

on a grid site node, the job exits with status code 139, which I believe is SIGURG - Urgent condition on socket (4.2BSD).
The status code from TApplication::HandleException is 128 + root enum, and 11 is kSigUrgent.
See:
https://root.cern.ch/doc/master/TApplication_8cxx_source.html#l00602
https://root.cern.ch/doc/master/TSysEvtHandler_8h_source.html#l00107

Running the same program interactively on the same file works fine. And it seems that only some sites with remote reading are failing. So we would like to ask for help in trying to track this down.

Currently, there is no stack trace to help understand things, and a simple 'print' just after TFile::Open is not printed.

I tried to add:

gApplication->ExitOnException( TApplication::kDontExit );

thinking that https://root.cern.ch/doc/master/TApplication_8cxx_source.html#l00602

void TApplication::HandleException might throw an exception, but this does not work.

So suggestions would be welcome. Is there a way to get a stack trace or more information on what is going on in the I/O part of this file open?

I don't know how to add in watchers for people in ATLAS, or a mailing list. But I did find @krasznaa.

                          thanks much, RD

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions