-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM cluster move #611
Comments
Will we get a new picture of the cluster once it is in its new home? |
release machines are back online for now, armv6 and armv8, others are off, no ETA on proper connection yet |
So with no ETA for the return of the Raspberry Pi cluster: For the jobs that are stalled waiting for the Raspberry Pi farm, will they kick off tomorrow or whenever the farm comes back online? Or probably not and the jobs should just be canceled now? Land stuff without Raspberry Pi test results in CI? Or wait for the Raspberry Pi cluster to come back? /cc @nodejs/ctc |
EDIT: Just disabling the job woked, it is properly skipped by |
Sorry, I thought I removed node-test-commit-arm-fanned. There shouldn't be any queued jobs, if there are then I've messed up! |
Thursday the 9th is the date I've been given for finalising this internet connection. Apparently there are some technical challenges (also I think some administrative incompetence but that's to be expected when dealing with large telcos!). |
Bad news .. I've been notified there are network problems in the area (monopoly government-provided internet infrastructure, yay) and it's been deferfed for another week. If it goes through then it should be up on the 16th of this month. |
We have three releases and we might get RCs out soon. Should we hold them till this setup is back up? We cannot release binaries without testing them, right? |
I have two questions:
|
The LTS releases are planned for Feb 21st so availability on the 16 may not affect those directly. It may affect plan RC's, in that case the question would be if the changes going in that we wanted validation through the RC would be ARM only or can be adequately covered by use on other platforms. In terms of testing for the Current release, I wonder if the binaries could be tested manually by somebody with access to the release machine logging in and running the tests. That might take a while to run thought since it would be on the single machine instead of fanned like it is in the regular jobs. |
I think it's OK to release RCs without testing in the ARM cluster in this situation. Maybe explain/apologize in the release announcement. And actual release (as opposed to an RC) might be different.... |
Same, RCs/Betas should be fine. |
AAAAND we're back up online again on a new stable connection that's quite a bit faster than the old one as a bonus. Working my way through everything but I'm pretty sure I've got most things in place already so it should be working as it used to before the move. Please let me know if you encounter anything that doesn't seem right. Regarding RCs and nightlies, I think that it got screwed up after a reconnect of my temporary connection where a new dynamic IP got assigned which messed up the iptables rules on both Jenkins machines. They were working just not connecting! Ooops! |
Jobs seem to be running well! There are still 3 slaves offline and the DNS for the jump host is not updated, but this is not urgent. However, we have some tests failing:
|
Thanks to @Trott for jumping on test-dgram-address @ nodejs/node#11432, looks like that'll be addressed soon. Full green run @ https://ci.nodejs.org/job/node-test-binary-arm/6241/ I've taken three Pi's offline, suspecting corrupted filesystems or dodgy SD cards, some of the failures were because of that. I'll address them as soon as I can and bring them back online. |
Failures on test-requireio_arm-ubuntu1404-arm64_xgene-2 are interesting, e.g. https://ci.nodejs.org/job/node-test-commit-arm/7806/nodes=armv8-ubuntu1404/ and correlate with disconnection notifications that we keep on getting for just this machine and they date back pretty far (prior to the move). I was tinkering on that box last night trying to understand it but I have no idea what's going on. There's nothing special about it, in fact it's the least special of the 3 XGene machines (one runs the NFS for the Pi's and does release builds, another serves as a jump host for SSH, this one just runs test builds and nothing else!). Something about Jenkins keeps on disconnecting and reconnecting, perhaps it's a Java problem.. Anyone got ideas for debugging this? @joaocgreis, @jbergstroem? |
@rvagg It's strange that it's just that one machine. I have no solution, but perhaps you can try a different ping interval from the slave side. This is used for Windows:
|
https://ci.nodejs.org/computer/test-requireio_arm-ubuntu1404-arm64_xgene-2/builds I tweaked the job slightly after posting the above and you can see that it's mostly green since then. It now downloads slave.jar before starting, each time, under the theory that having an updated slave.jar would be good ... but tbh I don't know if that's been a problem at all. Kernel logs are still full of:
But failures are less frequent now but they still happen. I've implemented the extended ping interval thing just now so let's see if that helps at all. |
@rvagg does the exits correlate with anything interesting in the logs? |
@jbergstroem well, when I look at the actual times, it would correlate with anything that's happening on the machine:
(who knew basically it's constantly happening. Going to have to run this manually and see if I can get anything from it. |
captured a failure, not sure if this is the failure, relevant log portions after connect are here: https://gist.github.com/rvagg/8eeb20b0fe7cf289601593ebff5bb827 There's a problem with child processes not being cleaned up properly which seems to cause Jenkins grief (never seen this before elsewhere) and then when it tries to reconnect it gets the kind of error you get when a node is already connected and it keeps on looping from there, which is similr behaviour to what I'm seeing with it running under upstart. I'm trying out disabling the process tree killer as per https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller to see if that helps, perhaps this is an architecture thing (i.e. this thing is "native code"). |
ping @rvagg -- can this be closed? |
Sad news folks, I have to physically move the ARM cluster today and the internet connection where it's moving to isn't properly installed yet! I have a temporary connection ready but I'm paying by the GB for it so I can't hook up normal test runs.
There are some ARMv7 and ARMv8 machines in Jenkins (and ARMv7 in release) that aren't physically in the same place (i.e. they are hosted at Scaleway and miniNodes) so they won't be impacted.
So here's what I'm going to do:
I'll keep this thread updated as I make progress.
@nodejs/collaborators
The text was updated successfully, but these errors were encountered: