-
Notifications
You must be signed in to change notification settings - Fork 37
Description
"tmad test crashes for some iconfig (channel/iconfig mapping issues and SIGFPE erroneous arithmetic operation)"
Hi @oliviermattelaer this is a follow up to the discussions in #826 and PR #853.
I prefer to open this as a clean issue and investigate this independently of SUSY, or in any case of zero cross section #826.
In these discussions from your patch #853 I realised that we risk having a MAJOR problem not only for BSM but also for SM, namely: all of my 'tmad' tests test only iconfig=1. These were ok so far (in some cases by luck maybe), but for different iconfig (i.e. if we put a number different from 1 in the input_app.txt piped to madevent.
Indeed I found a crash on the first test I executed, ggttgg with iconfig=104.
./tmad/madX.sh -ggttgg -iconfig 104
...
On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
Working directory (run): /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
*** (1) EXECUTE MADEVENT_FORTRAN (create results.dat) ***
[OPENMPTH] omp_get_max_threads/nproc = 1/4
[NGOODHEL] ngoodhel/ncomb = 64/64
[XSECTION] VECSIZE_USED = 8192
[XSECTION] MultiChannel = TRUE
[XSECTION] Configuration = 104
[XSECTION] ChannelId = 112
[XSECTION] Cross section = 0.4632 [0.46320556621222242] fbridge_mode=0
[UNWEIGHT] Wrote 11 events (found 187 events)
[COUNTERS] PROGRAM TOTAL : 4.4430s
[COUNTERS] Fortran Overhead ( 0 ) : 0.2478s
[COUNTERS] Fortran MEs ( 1 ) : 4.1953s for 8192 events => throughput is 1.95E+03 events/s
*** (1) EXECUTE MADEVENT_FORTRAN x1 (create events.lhe) ***
[OPENMPTH] omp_get_max_threads/nproc = 1/4
[NGOODHEL] ngoodhel/ncomb = 64/64
[XSECTION] VECSIZE_USED = 8192
[XSECTION] MultiChannel = TRUE
[XSECTION] Configuration = 104
[XSECTION] ChannelId = 112
[XSECTION] Cross section = 0.4632 [0.46320556621222242] fbridge_mode=0
[UNWEIGHT] Wrote 11 events (found 168 events)
[COUNTERS] PROGRAM TOTAL : 4.4488s
[COUNTERS] Fortran Overhead ( 0 ) : 0.2487s
[COUNTERS] Fortran MEs ( 1 ) : 4.2002s for 8192 events => throughput is 1.95E+03 events/s
*** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7effbd423860 in ???
#1 0x7effbd422a05 in ???
#2 0x7effbd054def in ???
#3 0x44b5ff in ???
#4 0x4087df in ???
#5 0x409848 in ???
#6 0x40bb83 in ???
#7 0x40d1a9 in ???
#8 0x45c804 in ???
#9 0x434269 in ???
#10 0x40371e in ???
#11 0x7effbd03feaf in ???
#12 0x7effbd03ff5f in ???
#13 0x403844 in ???
#14 0xffffffffffffffff in ???
./tmad/madX.sh: line 387: 780951 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed
This uses a sightly modified script, I will pur it in a PR.
I guess that the solution goes through what you proposed in #852 and the additional modifications you and I discussed there.
(Note: the 'tlau' tests that I proposed in July last year just before my absence were supposed to test exactly this (see #711), i.e. test all possible iconfig at the same time in a user-like enviornment, for all processes, but using a short manageable time. I continue to think that allowing the possibility to run shorter generate_events tests is necessary to allow better testing. There was disagreement last year, I hope we can come back and agree on this).