Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System unavailable: trss.adoptopenjdk.net #1679

Closed
sxa opened this issue Nov 12, 2020 · 21 comments
Closed

System unavailable: trss.adoptopenjdk.net #1679

sxa opened this issue Nov 12, 2020 · 21 comments

Comments

@sxa
Copy link
Member

sxa commented Nov 12, 2020

  • Please describe the issue:
    System is unresponsive. There was an issue with the TRSS server yesterday but it was able to be fixed on the machine. Today the system was competely unresponsive and not contact could be made with it (Initially reported at ~3am GMT)

I have managed to restart the host on the provider (AWS) but after 20 minutes it is not responding on the ssh port (although it is pingable)

@sxa sxa added the systemdown label Nov 12, 2020
@sxa sxa added this to the November 2020 milestone Nov 12, 2020
@sxa sxa self-assigned this Nov 12, 2020
@sxa
Copy link
Member Author

sxa commented Nov 12, 2020

Server now responding to the ssh port. Unfortunately the backend Node.js process appears to be repeatedly crashing and restarting so the service is not yet responsive.

@sxa
Copy link
Member Author

sxa commented Nov 12, 2020

From the backend logs -

error: Forever detected script exited with code: 0
error: Script restart attempt #182
12:08:31 PM - warn: Cannot find the config file:  --configFile=/dev/mongodb/credentials/trssConf.json
12:08:32 PM - error: Exception in database query:  message=Cannot read property 'collection' of undefined, stack=TypeError: Cannot read property 'collection' of undefined
    at new TestResultsDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:257:23)
    at EventHandler.processBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:19:37)
    at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:11:13)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)
12:08:32 PM - error: Exception in database query:  message=Cannot read property 'collection' of undefined, stack=TypeError: Cannot read property 'collection' of undefined
    at new BuildListDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:278:23)
    at EventHandler.monitorBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:57:37)
    at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:12:13)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)
(node:3779) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'collection' of undefined
    at new AuditLogsDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:285:23)
    at EventHandler.processBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:42:23)
    at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:11:13)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)
(node:3779) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:3779) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:3779) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'collection' of undefined
    at new AuditLogsDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:285:23)
    at EventHandler.monitorBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:78:23)
    at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:12:13)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)
(node:3779) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)

@sxa
Copy link
Member Author

sxa commented Nov 12, 2020

Machine became unresponsive at 02:29:28 (base on the kernel messages) with an Out of memory situation.
I am trying to recover the system but it looks like there may have been configuration files etc. stored in the dynamic /dev filesystem (as per the snippet in the previous comment which referneces --configFile=/dev/mongodb/credentials/trssConf.json - I can see entries in the shell history for things like mkdir -p /dev/mongodb/data which slightly worries me since it suggests that's not a dynamically created area that will get regenerated somehow (After reboot there is no /dev/mongodb on the machine)

There is no information on restarting the mongodb service on https://github.com/AdoptOpenJDK/openjdk-test-tools/tree/master/TestResultSummaryService

I have tried starting mongo - it first had a problem with /dev/mongodb/log not existing, then not being owned by the correct user. After resolving that, systemctl start mongod appears to have got it working, but I'm still at a loss as to where the /dev/mongodb/credentials is supposed to come from.

Current status: mongodb, TRSSBackend and TRSSFrontend services are showing as active (running) but connections to the server (SSH or the nginx on 443) are not possible. Reason currently unknown (external firewall?) - I was lucky to have been able to get in while connections were allowed.

@sxa
Copy link
Member Author

sxa commented Nov 12, 2020

@llxia Need your input on the /dev/mongodb directory and also likely some doc updates if mongodb has to be started manually separately from the front and back end services.

@sxa
Copy link
Member Author

sxa commented Nov 12, 2020

Looks like all the database stuff had been stored on a ramdrive and is therefore lost and will need to be rebuilt.

@sxa
Copy link
Member Author

sxa commented Nov 12, 2020

Machine now has 16Gb of swap (equals the amount of RAM) and a 160Gb /data partition that we can use for the results database.

@sxa
Copy link
Member Author

sxa commented Nov 13, 2020

AWS moved the IP address on the host. After it rebooted there was still a log entry with the pold IP address but it subsequently switched. PR in for inventory change.

https://trss.adoptopenjdk.net address now pointing to the new IP address

MongoDB is now running on a persistent filesystem (The new /data) so we should be back in action ...

@sxa
Copy link
Member Author

sxa commented Nov 13, 2020

@llxia Can you update the documentation to cover the setup of MongoDB and how to restart it etc.

@llxia
Copy link

llxia commented Nov 13, 2020

We use standard cmd to install and to restart systemctl restart mongod. The only thing is that if there is user/password for DB access, then TRSS needs to know (in trssConf.json). In the above case, MongoDB started correctly. TRSS cannot find trssConf.json, so it cannot connect to MongoDB.

I will update the readme.

@karianna
Copy link
Contributor

Can this be closed now (nice job on the rescue BTW)?

@sxa
Copy link
Member Author

sxa commented Nov 16, 2020

Can this be closed now (nice job on the rescue BTW)?

I was holding off until we have the documentation updated with details of what goes into the trssConf.json on the production server so we don't hit so many problems next time (keeping this open stops us from forgetting about it...)

@llxia
Copy link

llxia commented Nov 16, 2020

The format about trssConf.json is documented in https://github.com/AdoptOpenJDK/openjdk-test-tools/tree/master/TestResultSummaryService#configure-file

If we need a backup copy of trssConf.json that is used in the production server, we can store it somewhere else. But I do not think we should put user/password in the readme.

@sxa
Copy link
Member Author

sxa commented Nov 17, 2020

Absolutely agree passwords shouldn't be in there (although we can store that elsewhere) but things like the data directory that we've set for mongo should be along with the other specifics of the production server setup such as the location of the config file (The docs just say that you should provide a --configfile option, but for the production server it's fixed to /data/db/trssConf.json in /etc/init.d/TRSSBackend and TRSSFrontend so we should state that as you'd never want to set it anywhere else on the production server

@llxia
Copy link

llxia commented Nov 17, 2020

The docs just say that you should provide a --configfile option, but for the production server it's fixed to /data/db/trssConf.json

This is because we have the forever services created for TRSS. During the service creation, we can specify the --configfile option. forever-service only needs to be created once at beginning of the machine configuration or change options. Maybe we should add this into the playbook?

forever-service install TRSSFrontend -e "NODE_ENV=production" -f " --workingDir /home/jenkins/openjdk-test-tools/TestResultSummaryService" --script /home/jenkins/openjdk-test-tools/TestResultSummaryService/frontend.js  -o " --configFile=/data/db/credentials/trssConf.json"
forever-service install TRSSBackend -e "NODE_ENV=production NODE_OPTIONS=--max_old_space_size=4096 " -f " --workingDir /home/jenkins/openjdk-test-tools/TestResultSummaryService " --script /home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js  -o " --configFile=/data/db/credentials/trssConf.json"

@sxa
Copy link
Member Author

sxa commented Nov 18, 2020

Is the information on forever mentioned anywhere else? We should definitely add those two commands into the Deployment Instructions section of the doc.

And yes I would agree that since we have all of the code to start mongo and nginx in the TRSS playbooks we should add the backend/frontend service setup there too. @Haroon-Khel Can you you at doing this please?

@llxia
Copy link

llxia commented Nov 18, 2020

Is the information on forever mentioned anywhere else? We should definitely add those two commands into the Deployment Instructions section of the doc.

The information was mentioned in the previous issue and I did a demo/recording a while back with more up to date information.

Just to be clear, the steps should be in the following order:

  1. install and start MongoDB and install forever service
  2. create user/password in MongoDB
  3. create trssConf.json with MongoDB user/password info
  4. create forever services created for TRSS

Correct me if I wrong, I do not think Step 2 can be in the playbook as it contains credentials. If we want to put Step 4 in the playbook (without Step 2), then we should start with an empty trssConf.json file. And Admin can create user/password in MongoDB and update trssConf.json manually later.

@aahlenst
Copy link
Contributor

I do not think Step 2 can be in the playbook as it contains credentials

Ansible has various mechanisms to inject credentials into playbooks. For example, there's ansible-vault. And there are variables that can be set when invoking ansible-playbook by using -e.

Before sinking hours into updating the playbooks, please consider the best approach for #1689.

@sxa
Copy link
Member Author

sxa commented Nov 18, 2020

@aahlenst To be clear my primary goal here is to ensure that what we have in production at the moment is documented along with the other setup instructions before putting time into moving it.

@Haroon-Khel Haroon-Khel modified the milestones: February 2021, March 2021 Mar 2, 2021
@Haroon-Khel Haroon-Khel modified the milestones: March 2021, April 2021 Apr 6, 2021
@Haroon-Khel Haroon-Khel modified the milestones: April 2021, May 2021 May 18, 2021
@Haroon-Khel Haroon-Khel modified the milestones: May 2021, June 2021 Jun 21, 2021
@sxa
Copy link
Member Author

sxa commented Jul 5, 2021

This needs revisiting to see what state we are currently in so we can recreate the TRSS server easily if required. Keeping this in the July milestone so we can define next steps/plan.

@sxa sxa modified the milestones: June 2021, July 2021 Jul 5, 2021
@Haroon-Khel Haroon-Khel modified the milestones: July 2021, August Aug 4, 2021
@sxa sxa modified the milestones: August 2021, September 2021 Sep 23, 2021
@sxa sxa pinned this issue Sep 23, 2021
@sxa
Copy link
Member Author

sxa commented Sep 23, 2021

Bumping to next month so we can try and progress this in a timely manner - potentially after discussions at the AQAvit calls

@sxa sxa modified the milestones: September 2021, October 2021 Sep 23, 2021
@sxa sxa modified the milestones: October 2021, December 2021 Dec 1, 2021
@sxa sxa modified the milestones: December 2021, 2022-01 (January) Jan 6, 2022
@sxa sxa unpinned this issue Jan 18, 2022
@sxa sxa added the doc label Mar 3, 2022
@sxa
Copy link
Member Author

sxa commented Mar 3, 2022

I still feel that we need the documentation on the setup in one place instead of having to find it from multiple sources, but since there seems no interest in doing this, I'm going to close this with a link to the comment that has the attachment with the extra instructions adoptium/aqa-test-tools#9 (comment)

Related #1327 since that has been stalled in places due to lack of clarity on some parts of the setup.

@sxa sxa closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants