Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base.julia_cmd(): correctly forward the --sysimage-native-code=no flag if it is provided #42185

Merged
merged 1 commit into from
Sep 9, 2021

Conversation

DilumAluthge
Copy link
Member

@DilumAluthge DilumAluthge commented Sep 9, 2021

Closes #42173
Ref #41238 (comment)

This is a replacement for #42173.

@DilumAluthge DilumAluthge added ci Continuous integration cmd Relates to calling of external programs labels Sep 9, 2021
@DilumAluthge DilumAluthge added the merge me PR is reviewed. Merge when all tests are passing label Sep 9, 2021
@DilumAluthge
Copy link
Member Author

@vtjnash @staticfloat We don't need to backport this, right? Since the motivation here is to fix the code coverage, and we only run code coverage on master.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, it was even already documented

@DilumAluthge
Copy link
Member Author

Or do we need to backport this because it is a bug fix? Since it was in the docstring.

@DilumAluthge DilumAluthge added the bugfix This change fixes an existing bug label Sep 9, 2021
@DilumAluthge DilumAluthge merged commit f6b38a6 into master Sep 9, 2021
@DilumAluthge DilumAluthge deleted the dpa/julia_cmd-sysimage-native-code branch September 9, 2021 22:29
@DilumAluthge
Copy link
Member Author

I'm going to mark this for backport as a bugfix. If that is not correct, please remove the backport labels.

@DilumAluthge DilumAluthge added backport 1.6 Change should be backported to release-1.6 backport 1.7 and removed merge me PR is reviewed. Merge when all tests are passing backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Sep 9, 2021
@DilumAluthge
Copy link
Member Author

DilumAluthge commented Sep 9, 2021

@staticfloat @vtjnash This PR seems to have broken the coverage job. See e.g. https://buildkite.com/julialang/julia-master-scheduled/builds/225

Run Julia tests in parallel with code coverage enabled
┌ Info: 
└   Sys.CPU_THREADS = 16
┌ Info: 
│   tests = "all"
└   ncores = 16
      From worker 15:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 15:	exiting.
Worker 15 terminated.
Unhandled Task ERROR: Version read failed. Connection closed by peer.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] process_hdr(s::TCPSocket, validate_cookie::Bool)
   @ Distributed /cache/build/amdci5-2/julialang/julia-master-scheduled/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:267
 [3] message_handler_loop(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /cache/build/amdci5-2/julialang/julia-master-scheduled/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:151
 [4] process_tcp_streams(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /cache/build/amdci5-2/julialang/julia-master-scheduled/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:126
 [5] (::Distributed.var"#99#100"{TCPSocket, TCPSocket, Bool})()
   @ Distributed ./task.jl:411
      From worker 9:	ErrorException("Process(1) - Invalid connection credentials sent by remote.")CapturedException(ErrorException("Process(1) - Invalid connection credentials sent by remote."), Any[(error(s::String) at error.jl:33, 1), (process_hdr(s::Sockets.TCPSocket, validate_cookie::Bool) at process_messages.jl:257, 1), (message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) at process_messages.jl:151, 1), (process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) at process_messages.jl:126, 1), ((::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})() at task.jl:411, 1)])
      From worker 9:	Process(1) - Unknown remote, closing connection.
Worker 9 terminated.
Unhandled Task ERROR: Version read failed. Connection closed by peer.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] process_hdr(s::TCPSocket, validate_cookie::Bool)
   @ Distributed /cache/build/amdci5-2/julialang/julia-master-scheduled/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:267
 [3] message_handler_loop(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /cache/build/amdci5-2/julialang/julia-master-scheduled/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:151
 [4] process_tcp_streams(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /cache/build/amdci5-2/julialang/julia-master-scheduled/usr/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:126
 [5] (::Distributed.var"#99#100"{TCPSocket, TCPSocket, Bool})()
   @ Distributed ./task.jl:411
      From worker 9:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 9:	exiting.

@DilumAluthge
Copy link
Member Author

I'm removing the backport labels until we can figure out why this broke the code coverage Buildkite job.

@vtjnash
Copy link
Member

vtjnash commented Sep 10, 2021

looks like buildkite needs to increase the timeout threshold for starting workers, as we launch a lot of them, and these are probably noisy machines

@vtjnash
Copy link
Member

vtjnash commented Sep 10, 2021

Note that it is a 32-core machine, and it looks like buildkite is mis-configured to force the launching of 16*16 (or 256) threads

@DilumAluthge
Copy link
Member Author

looks like buildkite needs to increase the timeout threshold for starting workers, as we launch a lot of them, and these are probably noisy machines

We could increase the timeout. That being said, before this PR was merged, the code coverage Buildkite job had no trouble starting all of its workers within the default timeout. See e.g. the following successful jobs:

Is it possible that this PR specifically would increase the time it takes a worker to start up?

@DilumAluthge
Copy link
Member Author

To increase the timeout, I'm assuming I just need to set the JULIA_WORKER_TIMEOUT environment variable to something greater than 60?

@vtjnash
Copy link
Member

vtjnash commented Sep 10, 2021

Yes, but also, we need to fix that configuration bug too

@DilumAluthge
Copy link
Member Author

Yes, but also, we need to fix that configuration bug too

Ah, the 16 workers * 16 threads per worker?

For that one, I think we can just set JULIA_NUM_THREADS=1 in the code coverage Buildkite job, right?

@DilumAluthge DilumAluthge added backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Sep 10, 2021
@DilumAluthge
Copy link
Member Author

Increasing the worker timeout seems to have resolved the issue with the code coverage job, so I'm adding the backport labels back.

KristofferC pushed a commit that referenced this pull request Sep 11, 2021
… flag if it is provided (#42185)

(cherry picked from commit f6b38a6)
KristofferC pushed a commit that referenced this pull request Sep 15, 2021
… flag if it is provided (#42185)

(cherry picked from commit f6b38a6)
KristofferC pushed a commit that referenced this pull request Nov 11, 2021
… flag if it is provided (#42185)

(cherry picked from commit f6b38a6)
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Nov 13, 2021
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
… flag if it is provided (#42185)

(cherry picked from commit f6b38a6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug ci Continuous integration cmd Relates to calling of external programs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants