Skip to content

OWLS-76974: Add more debugging/help when node manager fails to start #2393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 3, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 35 additions & 4 deletions operator/src/main/resources/scripts/startNodeManager.sh
Original file line number Diff line number Diff line change
Expand Up @@ -337,13 +337,44 @@ while [ 1 -eq 1 ]; do
break
fi
if [ $((SECONDS - $start_secs)) -ge $max_wait_secs ]; then
trace INFO "Trying to put a node manager thread dump in '$nodemgr_out_file'."
kill -3 `jps -l | grep weblogic.NodeManager | awk '{ print $1 }'`
pid=$(jps | grep NodeManager | awk '{ print $1 }')
if [ -z $pid ]; then
trace INFO "Node manager process id not found. Cannot create thread dump."
else
trace INFO "Node manager process id is '$pid'."
trace INFO "Trying to put a node manager thread dump in '$nodemgr_out_file'."
kill -3 $pid
if [ -x "$(command -v $JAVA_HOME/bin/jcmd)" ]; then
trace INFO "Node manager thread dump:"
$JAVA_HOME/bin/jcmd $pid Thread.print
fi
fi
trace INFO "Entropy: "
cat /proc/sys/kernel/random/entropy_avail
trace INFO "Contents of node manager log '$nodemgr_log_file':"
cat ${nodemgr_log_file}
trace INFO "Contents of node manager out '$nodemgr_out_file':"
cat ${nodemgr_out_file}
trace SEVERE "Node manager failed to start within $max_wait_secs seconds."
cat ${NODEMGR_OUT_FILE}

trace SEVERE $(cat << EOF
The node manager failed to start within $max_wait_secs seconds.
To increase this timeout, define the NODE_MANAGER_MAX_WAIT
environment variable in your domain resource, and set it higher
than $max_wait_secs. To diagnose the problem, see the above INFO
messages for node manager log contents, stdout contents, pid,
thread dump, and entropy. If the log and stdout contents are
sparse and reveal no errors, then the node manager may be stalled
while generating entropy -- especially if entropy is below 500.
If entropy is the problem, then for testing purposes you can
temporarily work around this problem by specifying
'-Djava.security.egd=file:/dev/./urandom' in a USER_MEM_ARGS
environment variable defined via your domain resource, but
for production purposes the problem should be solved by following
the guidance in
'https://docs.oracle.com/en/middleware/fusion-middleware/weblogic-server/12.2.1.4/nodem/starting_nodemgr.html#GUID-53961E3A-D8E1-4556-B78A-9A56B676D57E'
(search for keyword 'rngd').
EOF
)
exit 1
fi
done
Expand Down