Skip to content

Conversation

@jsquyres
Copy link
Member

@bwbarrett Note that I updated R for all the .so versions. See the commit message.

Refs #5990.

@jsquyres jsquyres added this to the v3.1.4 milestone Feb 20, 2019
@jsquyres jsquyres requested a review from bwbarrett February 20, 2019 20:27
@jsquyres jsquyres force-pushed the pr/v3.1.x/prepare-for-3.1.4-release branch from 0d4a7bf to f73acda Compare February 23, 2019 13:27
@jsquyres jsquyres force-pushed the pr/v3.1.x/prepare-for-3.1.4-release branch from f73acda to a4fe574 Compare February 26, 2019 17:37
@jsquyres
Copy link
Member Author

@bwbarrett The NEWS in this PR assumes we get a fix for #6436.

jsquyres added 4 commits March 1, 2019 11:00
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Note that I bumped the R version for all of the libraries because we
updated atomics macros in OPAL, which basically affects everything.
It might be a little overkill to update all the R values, but it's not
harmful.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres jsquyres force-pushed the pr/v3.1.x/prepare-for-3.1.4-release branch from a4fe574 to 4e34820 Compare March 1, 2019 19:20
@jsquyres
Copy link
Member Author

jsquyres commented Mar 1, 2019

@artpol84 @hoopoepg The overlap test is failing on the Mellanox CI.

Output from failing `overlap` test in Mellanox CI
00:22:40 + taskset -c 10,11 timeout -s SIGSEGV 17m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to none -mca orte_tmpdir_base /tmp/tmp.j10EKNyIZs --report-state-on-timeout --get-stack-traces --timeout 900 -mca coll '^hcoll' -mca btl_openib_if_include mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -mca btl_openib_allow_ib true -mca btl self -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap
00:22:40 [1551478960.676925] [jenkins03:25039:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
00:22:40 [1551478960.681230] [jenkins03:25045:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
00:22:40 [1551478960.687256] [jenkins03:25040:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
00:22:40 [1551478960.690290] [jenkins03:25048:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.92
00:22:41 Time per iteration on each process (ms)
00:22:41 Time 	 Compute time 	 Comm time
00:22:46 [jenkins03:25045:0:25067] address not mapped to object
00:22:46 [jenkins03:25040:0:25065] address not mapped to object
00:22:46 [jenkins03:25048:0:25068] address not mapped to object
00:22:46 [jenkins03:25039:0:25066] address not mapped to object
00:22:46 
==== backtrace ====
00:22:46  0 0x000000000000b976 _dl_relocate_object()  :0
00:22:46  1 0x0000000000013b3c dl_open_worker()  dl-open.c:0
00:22:46  2 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  3 0x00000000000131ab _dl_open()  :0
00:22:46  4 0x0000000000130a02 do_dlopen()  dl-libc.c:0
00:22:46  5 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  6 0x0000000000130ac2 __GI___libc_dlopen_mode()  :0
00:22:46  7 0x000000000000f803 pthread_cancel_init()  :0
00:22:46  8 0x000000000000f9cc _Unwind_ForcedUnwind()  :0
00:22:46  9 0x000000000000dd60 __GI___pthread_unwind()  :0
00:22:46 10 0x0000000000008dd5 __pthread_exit()  :0
00:22:46 11 0x0000000000400d49 threadfunc()  ???:0
00:22:46 12 0x0000000000007dc5 start_thread()  pthread_create.c:0
00:22:46 13 0x00000000000f61cd __clone()  ???:0
00:22:46 ===================
00:22:46 
==== backtrace ====
00:22:46  0 0x000000000000b976 _dl_relocate_object()  :0
00:22:46  1 0x0000000000013b3c dl_open_worker()  dl-open.c:0
00:22:46  2 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  3 0x00000000000131ab _dl_open()  :0
00:22:46  4 0x0000000000130a02 do_dlopen()  dl-libc.c:0
00:22:46  5 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  6 0x0000000000130ac2 __GI___libc_dlopen_mode()  :0
00:22:46  7 0x000000000000f803 pthread_cancel_init()  :0
00:22:46  8 0x000000000000f9cc _Unwind_ForcedUnwind()  :0
00:22:46  9 0x000000000000dd60 __GI___pthread_unwind()  :0
00:22:46 10 0x0000000000008dd5 __pthread_exit()  :0
00:22:46 11 0x0000000000400d49 threadfunc()  ???:0
00:22:46 12 0x0000000000007dc5 start_thread()  pthread_create.c:0
00:22:46 13 0x00000000000f61cd __clone()  ???:0
00:22:46 ===================
00:22:46 
==== backtrace ====
00:22:46  0 0x000000000000b976 _dl_relocate_object()  :0
00:22:46  1 0x0000000000013b3c dl_open_worker()  dl-open.c:0
00:22:46  2 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  3 0x00000000000131ab _dl_open()  :0
00:22:46  4 0x0000000000130a02 do_dlopen()  dl-libc.c:0
00:22:46  5 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  6 0x0000000000130ac2 __GI___libc_dlopen_mode()  :0
00:22:46  7 0x000000000000f803 pthread_cancel_init()  :0
00:22:46  8 0x000000000000f9cc _Unwind_ForcedUnwind()  :0
00:22:46  9 0x000000000000dd60 __GI___pthread_unwind()  :0
00:22:46 10 0x0000000000008dd5 __pthread_exit()  :0
00:22:46 11 0x0000000000400d49 threadfunc()  ???:0
00:22:46 12 0x0000000000007dc5 start_thread()  pthread_create.c:0
00:22:46 13 0x00000000000f61cd __clone()  ???:0
00:22:46 ===================
00:22:46 
==== backtrace ====
00:22:46  0 0x000000000000b976 _dl_relocate_object()  :0
00:22:46  1 0x0000000000013b3c dl_open_worker()  dl-open.c:0
00:22:46  2 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  3 0x00000000000131ab _dl_open()  :0
00:22:46  4 0x0000000000130a02 do_dlopen()  dl-libc.c:0
00:22:46  5 0x000000000000f1b4 _dl_catch_error()  :0
00:22:46  6 0x0000000000130ac2 __GI___libc_dlopen_mode()  :0
00:22:46  7 0x000000000000f803 pthread_cancel_init()  :0
00:22:46  8 0x000000000000f9cc _Unwind_ForcedUnwind()  :0
00:22:46  9 0x000000000000dd60 __GI___pthread_unwind()  :0
00:22:46 10 0x0000000000008dd5 __pthread_exit()  :0
00:22:46 11 0x0000000000400d49 threadfunc()  ???:0
00:22:46 12 0x0000000000007dc5 start_thread()  pthread_create.c:0
00:22:46 13 0x00000000000f61cd __clone()  ???:0
00:22:46 ===================
00:22:47 --------------------------------------------------------------------------
00:22:47 Primary job  terminated normally, but 1 process returned
00:22:47 a non-zero exit code. Per user-direction, the job has been aborted.
00:22:47 --------------------------------------------------------------------------
00:22:47 --------------------------------------------------------------------------
00:22:47 mpirun noticed that process rank 2 with PID 0 on node jenkins03 exited on signal 11 (Segmentation fault).
00:22:47 --------------------------------------------------------------------------

It's segv'ing, which suggests it's timing out.

The problem is clearly not coming from this PR, but my question is: do we have a problem on the v3.1.x branch? Or is this some kind of local false failure?

@jsquyres
Copy link
Member Author

jsquyres commented Mar 5, 2019

bot:mellanox:retest

@hoopoepg
Copy link
Contributor

hoopoepg commented Mar 5, 2019

@yosefe look like similar issue with openucx/ucx#3303 - it crashes on jenkins

@jsquyres jsquyres mentioned this pull request Mar 5, 2019
@jsquyres
Copy link
Member Author

jsquyres commented Mar 6, 2019

Per mail to the core mailing list, the Mellanox CI is having persistent problems right now, and it may take a few days to fix. This is literally just a README / NEWS / VERSION change, so I'm going to merge.

@jsquyres jsquyres merged commit e6f7f87 into open-mpi:v3.1.x Mar 6, 2019
@jsquyres jsquyres deleted the pr/v3.1.x/prepare-for-3.1.4-release branch March 6, 2019 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants