- cleanup.Jenkinsfile: Jenkinsfile with Declarative Pipeline Multiline sh that cleanups old builds. All the Stages are now visually monitored. It is triggered every saturday night and ends with jenkins restart. These Multi-line bash commands make easier to read Jenkins Projects.
- daily_restart.Jenkinsfile: A script that automatically triggers a daily restart of Jenkins due to performance issues (Jenkins is a Java application). Jenkins with Declarative Pipeline multiline sh that restarts Jenkins every night except on Saturday nights (when cleanup.Jenkinsfile is triggered).
- confluence6-docker-build.Jenkinsfile: Declarative Jenkinsfile for building and uploading a docker image to Openshift-DEV, Dockerhub and Openshift-PROD (Stages are disabled via Conditional Build Steps). Tip: A Docker Plugin for Jenkins can easily replace this Jenkinsfile. This script can be found here.
This cleanup.Jenkinsfile script is supposed to be run in jenkins master nodes, where all the legacy build artifacts are achived ($JENKINS_HOME/jobs/YourProjectName/builds).
Authentication is preferably with an -auth option, which takes a username:apitoken argument. Get your API token from [jenkins_url]/me/configure
The corresponding q-number's password can be used instead of the mentioned "apitoken". The former one is LDAP based, while apitoken is unique to each Jenkins Master node.
Username and password credentials (or username and apitoken) need to be saved in each jenkins master server in $JENKINS_HOME/.jenkins-cli with the following format: "username:password" or "username:apitoken".
The chosen username requires Overall/Administer permission in order to restart Jenkins. Otherwise you will obtain the following error:
ERROR: q-username is missing the Overall/Administer permission
- Jenkins job -> New Item -> "job name" Pipeline -> Pipeline SCM: git , Repository: "this repo", Script Path: cleanup.Jenkinsfile
- Jenkins job -> New Item -> "job name" Pipeline -> Pipeline SCM: git , Repository: "this repo", Script Path: daily_restart.Jenkinsfile
- Trigger a first manual build of the created job to load configuration from Jenkinsfile
- Jenkins Master nodes: jenkins master nodes archive all the legacy build artifacts in $JENKINS_HOME/jobs/YourProjectName/builds.
- Jenkins Slave nodes: jenkins slaves keep the latest build in their corresponding Workspace in /home/cloud/jenkins_slave/workspace/YourJobName.
Once old build folders have been deleted, jenkins' index of builds needs to be updated by reloading its configuration (fast) or by restarting jenkins (slow). This "reload-configuration" of Jenkins is finally not run in Jenkinsfile cleanup pipeline as we are proceeding with an automated jenkins restart in its last stage. This restart needs to be done every night and it makes sense to be included in this cleanup job.
- [jenkins_url]/safeRestart. – This will restart Jenkins after the current builds have completed.
- [jenkins_url]/restart – This will force a restart. Builds will not wait to complete.
daily_restart.Jenkinsfile and cleanup.Jenkinsfile apply a hard restart of Jenkins Master when run in Main Jenkins Master Enterprise. A safe-restart was initially setup for all Jenkins Master nodes but it could easily last for 2-3 hours in Main Jenkins (sometimes requiring the Jenkins Admin to kill the remaining zombie jobs).
The current configuration is the following:
- Main Jenkins Master: restart
- Remaining Jenkins Master nodes: safeRestart
IFS env var is modified within this script in order to avoid errors when find command tries to navigate through a Project's folder with spaces in its name. By setting this up we make sure that all the existing folders related to Jenkins Projects are treated correctly and no errors arise when this script is run.
Item | Parameter | Details |
---|---|---|
IFS env var | IFS=$(echo -en "\n\b") | IFS lines discard any space and meta char in a folder's name |
find command | -xdev | Don't descend directories on other filesystems |
find command | -mtime n | File's data was last modified n*24 hours ago |
find command | -delete | removes a file , not valid with directories |
find command | -mindepth 1 -maxdepth 1 -type d | no recursive search for directories inside the current directory |
find command | -print0 | ASCII NUL character to separate the entries in the file list that it produces (Unusual characters in filenames) |
xargs command | -0 | assume that arguments are separated with ASCII NUL instead of white space |
df command | cut -d" " -f2- | first column output is removed as device name is too long to be displayed |
xargs command | -t | Used in cleanup-script.sh to show a list of old build folders to be deleted. Used in Jenkinsfile to get traces of deleted data. |
This chapter is necessary to understand how to deal with a trouble Jenkins Enterprise POD (unresponsive and/or with more restarts than the ones triggered by this daily restart).
The below matrix table aims to help with solutions to well known incidents:
Error | Solution or Workaround |
---|---|
Jenkins Enterprise is unresponsive or extremely slow | Force a POD Restart by A) Scaling the POD to 0 and then to 1 from Openshift GUI, or by B) killing the Java Process of Jenkins: kill 7 |
Restart of Jenkins Enterprise and its POD is taking too long > 1 hour | Deploy Jenkins Microservice: A) "oc rollout latest jenkins", or B) clicking on 'Deploy' button in Openshift GUI |
- Exit codes seen when using oc get pods or kubectl get pods are unix shell exit codes (echo $?).
- Google "bash error code 137", "bash error code 143", etc.
- Signal 15 is a SIGTERM (see "kill -l" for a complete list). It's the way most programs are gracefully terminated, and is relatively normal behaviour.This indicates system has delivered a SIGTERM to the processes. This is usually at the request of some other process (via kill()) but could also be sent by your process to itself (using raise()). This signal requests an orderly shutdown of process or system itself. The real question is "Who/what is sending the SIGTERM?"
Exit Code Number | Meaning | Example | Comments |
---|---|---|---|
128 +n | Fatal error signal "n" | kill -n $PPID of script | $? returns 128 + n |
137 (128 + 9) | Fatal error signal "9" (SIGKILL) | kill -9 $PPID of script | $? returns 137 (128 + 9) |
143 (128 + 15) | Fatal error signal "15" (SIGTERM) | kill -15 $PPID of script | $? returns 143 (128 + 15) |
user@jumphost:~> kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT
17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU
25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH
29) SIGIO 30) SIGPWR 31) SIGSYS 34) SIGRTMIN
35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4
39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12
47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14
51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10
55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6
59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
- No errors are seen when Jenkins Microservice has been deployed again via OpenShift GUI (deploy button) or via "oc rollout latest jenkins_id". Deployment number will be increased by 1 (like deployment #20, or "revision number 20" in 'oc rollout history'). Restart Count is 0:
Container jenkins State: Running since May 10, 2019 11:24:11 AM Ready: true Restart Count: 0
- Container with Out of Memory Kill. OutOfMemory error in Jenkins ("OOM" error code). Container/POD is using too much memory and is killed automatically (new restart triggered)
- Java Process killed manually with a "kill -9 [PID]" (as explained above)
- When scaling down number of pods, the following error is reported in the OpenShift console, even though the pod is stopped correctly:
Root Cause: This issue is related to the exit codes returned from the JVM when it receives signals such as SIGINT, SIGTERM, etc. The JVM default is to return an exit code of 128+signal-id. E.g for SIGTERM one would see an exit code of 143 (128+15). This causes Red Hat OpenShift Container Platform to display a warning message on pod scale down stating that that the container did not stop cleanly (when in actual fact, it did).
The container spring-boot did not stop cleanly when terminated (exit code 143).
- This error also arises when Jenkins is restarted:
- A manual or automated restart of Jenkins also implies a new Container/POD restart.
- Container restartCount is increased
- A Jenkins Restart is triggered automatically by Jenkinsfile-daily-restart found in this repo (Jenkins cli).
- A Jenkins Restart can also be accomplished from the Jenkins GUI: This is not available when Jenkins is overloaded and unresponsive.
- Simple workaround to force a Jenkins Restart: In the following example we force a Container/POD restart by killing Jenkins Java process with a "kill PID" (SIGTERM signal, instead of SIGKILL signal with "kill -9"):
user@jumpserver:~/> oc get pods | grep ^jenkins-2 jenkins-20-1d4vm 1/1 Running 0 45m user@jumpserver:~/> oc rsh jenkins-20-1d4vm sh-4.2$ ps ux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 1019710+ 1 0.0 0.0 4324 560 ? Ss 11:24 0:00 /bin/tini -vv 1019710+ 7 109 0.5 61154992 5864592 ? Sl 11:24 49:43 java -Djava.aw 1019710+ 938 0.0 0.0 13520 2076 ? Ss+ 11:27 0:00 /bin/sh 1019710+ 3037 0.0 0.0 13520 2000 ? Ss+ 11:31 0:00 /bin/sh 1019710+ 7723 0.0 0.0 13516 2064 ? Ss+ 11:44 0:00 /bin/sh 1019710+ 16464 2.0 0.0 13516 1936 ? Ss 12:09 0:00 /bin/sh 1019710+ 16474 0.0 0.0 49060 1828 ? R+ 12:09 0:00 ps ux sh-4.2$ kill 7
-
oc get pods
-
oc get pods -o wide
-
oc describe pod jenkins-as-a-service-73-07xwx
-
oc describe svc jaas-test
-
oc get svc
-
oc set env pods --all --list
-
ps -eo pid,lstart,cmd | fold -w 130 (checking Jenkins Java process with arguments and start time)
user@jumpserver:~> oc rsh jenkins-19-0n5jm sh-4.2$ ps -eo pid,lstart,cmd | fold -w 130 PID STARTED CMD 1 Sat Apr 28 00:54:42 2019 /usr/bin/dumb-init -- /usr/libexec/s2i/run 7 Sat Apr 28 00:54:42 2019 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:Adap tiveSizePolicyWeight=90 -XX:MaxMetaspaceSize=100m -Duser.home=/var/lib/jenkins -XX:MaxMetaspaceSize=512m -Djavamelody.application- name=JENKINS -Dfile.encoding=UTF8 -jar /usr/lib/jenkins/jenkins.war --prefix=/myjenkins 45744 Sun Apr 29 15:12:43 2018 /bin/sh 47707 Wed May 9 16:51:16 2018 /bin/sh 54481 Wed May 9 18:20:04 2018 /bin/sh 54656 Wed May 9 18:22:41 2018 ps -eo pid,lstart,cmd 54657 Wed May 9 18:22:41 2018 fold -w 130
-
oc get pod jenkins-19-0n5jm
-
POD Status: oc get pod jenkins-19-0n5jm -o yaml
- In the following example, LastState:
- startedAt 2019-05-02T05:40:30Z (05:40:30 UTC -> 07:40:30 CEST)
- finishedAt 2019-05-03T04:38:08Z (04:38:08 UTC -> 06:38:08 CEST)
- CurrentState startedAt: ...
containerStatuses: - containerID: docker://9ad7f3a23d6328405090a380ed01305a9e21fefca91a49021db9b3848f208d09 image: myip:5000/prod/jenkins imageID: docker-pullable://myip:5000/delivery-prod/jenkins@sha256:8f5d31f51cef106df8e68ed432e190f9f6bf0e9307d3f35490584cc4db808c0d lastState: terminated: containerID: docker://de0a32ddab499573fe7cd79745751b6eb2d1e3d84e4c0f529929985ed5c9947a exitCode: 137 finishedAt: 2019-05-03T04:38:08Z reason: Error startedAt: 2019-05-02T05:40:30Z name: jenkins ready: true restartCount: 9 state: ...
- In the following example, LastState:
-
POD Status: oc get pod jenkins-20-1d4vm -o yaml
- In the following example, LastState:
- startedAt: 2019-05-14T03:01:40Z (3:01:40 UTC -> 5:01:40 CEST , daily restart)
- finishedAt: 2019-05-14T12:06:03Z (12:06:03 UTC -> 14:06:03 CEST)
- CurrentState:
- startedAt: 2019-05-14T12:06:09Z (12:06:09 UTC -> 14:06:09 CEST)
containerStatuses: - containerID: docker://ed4e7b3c2e427d0f5e3940bafe21737173a00df3f09e6853f53b85a47e023c30 image: myip:5000/delivery-prod/jenkins imageID: docker-pullable://myip:5000/delivery-prod/jenkins@sha256:8f5d31f51cef106df8e68ed432e190f9f6bf0e9307d3f35490584cc4db808c0d lastState: terminated: containerID: docker://92f7db3fbb442d1609abdad498051355266ef0072e97f226e2f9374932671044 exitCode: 137 finishedAt: 2019-05-14T12:06:03Z reason: Error startedAt: 2019-05-14T03:01:40Z name: jenkins ready: true restartCount: 5 state: running: startedAt: 2019-05-14T12:06:09Z hostIP: myip phase: Running podIP: ip-addr startTime: 2019-05-10T09:24:04Z
- In the following example, LastState:
-
oc get dc
user@jumphost:~/> oc get dc NAME REVISION DESIRED CURRENT TRIGGERED BY cjoc 11 1 1 jenkins 19 1 1 jenkins-1 27 0 0 jenkins-2 20 1 1 jenkins-3 1 0 0 jenkins-4 0 1 0 jenkins-5 9 1 1 jenkins-6 14 1 1 jenkins-7 9 1 1 nexus 7 1 1 sonar6 4 1 1 sonar6-postgresql 1 1 1 config,image(postgresql:9.5)
-
oc deploy --latest jenkins (deprecated, use "oc rollout jenkins" instead)
user@jumphost:~/> oc deploy --latest jenkins Flag --latest has been deprecated, use 'oc rollout latest' instead Started deployment #20 Use 'oc logs -f dc/jenkins' to track its progress. user@jumphost:~/>
-
oc rollout latest jenkins: This is like clicking on "deploy" button on Openshift GUI when we want to deploy a POD. PROCEED WITH THIS IN CASE A RESTART OF JENKINS IS STUCK FOR MORE THAN 1 HOUR. Deployment number will be increased by 1 (like deployment #20, or "revision number 20" in 'oc rollout history').
-
oc logs -f dc/jenkins
-
oc rollout history dc/jenkins
user@jumphost:~/> oc rollout history dc/jenkins deploymentconfigs "jenkins" REVISION STATUS CAUSE 1 Complete manual change 2 Complete manual change 4 Complete manual change 5 Complete manual change 6 Complete manual change 7 Complete manual change 8 Complete manual change 9 Failed The deployment was cancelled by the user 10 Complete manual change 11 Failed manual change 12 Failed The deployment was cancelled by the user 13 Failed The deployment was cancelled by the user 14 Failed The deployment was cancelled by the user 15 Failed manual change 16 Failed The deployment was cancelled by the user 17 Failed manual change 18 Failed manual change 19 Complete manual change 20 Running manual change
-
Force the termination of POD:
$ oc delete pod jenkins-10-6095v --force=true --grace-period=0 warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "jenkins-10-6095v" deleted
-
oc describe quota
-
oc describe quota quota-48gi -n delivery-prod ("oc describe quota" in default delivery-prod Project)
user@jumphost:~/> oc describe quota Name: quota-48gi Namespace: delivery-prod Resource Used Hard -------- ---- ---- requests.cpu 22750m 30 requests.memory 36467264Ki 48Gi
-
oc status
-
oc status -v (warnings can be seen)
-
oc get all
-
oc logs -f po/jenkins-19-0n5jm | less -S
-
oc scale --replicas=0 dc jenkins (Change the number of pods in a deployment)
-
oc scale --replicas=3 dc jenkins (Change the number of pods in a deployment)
-
oc rollout history po/jenkins-19-0n5jm (viewing a deployment)
-
oc rollout history dc/jenkins --revision=19
-
oc get builds
-
oc edit: Edit a resource on the server
-
oc new-build: Create a new build configuration
-
oc start-build: Start a new build
-
oc cancel-build: Cancel running, pending, or new builds
-
oc import-image: Imports images from a Docker registry
-
oc tag: Tag existing images into image streams
-
oc debug jenkins-52-167c9 (Launch a new instance of a pod for debugging)
-
oc explain: Documentation of resources
-
oc explain deploy (Deployment enables declarative updates for Pods and ReplicaSets)
-
oc exec: Execute a command in a container
-
oc cp: Copy files and directories to and from containers.
- We can see the following output in OpenShift GUI after a jenkins restart (which implies a POD restart, with this POD's "restartCount" increased by 1):
Status: Running Deployment: jenkins, #19 IP: ipaddr.ip Node: nodeaddr.net (ipaddr)Restart Policy: Always Container jenkins State: Running since May 4, 2019 6:37:11 AM Last State Terminated at May 4, 2019 6:37:01 AM with exit code 143 (Error) Ready: false Restart Count: 10 Debug in Terminal
It can be easily automated via a jenkins job that triggers the following shell script:
-
mv /home/cloud/jenkins_slave/workspace /home/cloud/jenkins_slave/workspace.old && rm -Rf /home/cloud/jenkins_slave/workspace.old & mv /global/apps/cdbuilt/build_server/maven/<myproject>/local-repo-maven-3 /global/apps/cdbuilt/build_server/maven/<myproject>/local-repo-maven-3.old && rm -Rf /global/apps/cdbuilt/build_server/maven/<myproject>/local-repo-maven-3.old &
- Alternative: Workspace cleanup plugin: https://plugins.jenkins.io/ws-cleanup
- Jenkins.io: Jenkins CLI Authentication
- Stackoverflow: How to restart jenkins manually
- Cloudbees Support Guideline: Cleanup and disk space management
- GNU.org: Deleting files with find
- Dzone.com: Declarative Pipeline Refcard
- Cloudbees: Declarative Pipeline Quick Reference
- Kubernetes Docs
- Kubernetes.io: Pod Lifecycle
- Openshift: Application memory sizing
- Openshift: Openshift Container Platform 3.9 CLI Reference
- Openshift: Basic CLI Operations - Object types
- Youtube: Introduction to OpenShift Enterprise CLI
- redhat.com Video: OpenShift Enterprise 3.2 - Creating your first application from the CLI
- Openshift: Exit Code 137
- redhat.com: Openshift. Diagnosing an OOM Kill
- redhat.com: Scaling down pods with a running Java process results in a warning message (exit code 143)
- redhat.com: Bash exit codes with special meanings
- redhat.com: What is 'Signal 15' ?
- redhat.com: How to find out who sent the 'signal-15'
- Stackoverflow: How to debug container images using openshift
- Stackoverflow: How to restart pod in OpenShift?
- Reddit.com: jenkinsci