Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: add LUCI solaris-amd64 builder #61666

Closed
rorth opened this issue Jul 31, 2023 · 46 comments
Closed

x/build: add LUCI solaris-amd64 builder #61666

rorth opened this issue Jul 31, 2023 · 46 comments
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done. new-builder
Milestone

Comments

@rorth
Copy link

rorth commented Jul 31, 2023

s11-i386.foss.cebitec.uni-bielefeld.de.csr.txt

Somehow I'm not able to add the new-builder as required in the installation docs.

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Jul 31, 2023
@gopherbot gopherbot added this to the Unreleased milestone Jul 31, 2023
@cagedmantis cagedmantis self-assigned this Aug 1, 2023
@cagedmantis cagedmantis changed the title x/build: add solaris-amd64 builder x/build: add LUCI solaris-amd64 builder Aug 1, 2023
@cagedmantis
Copy link
Contributor

@roth The instructions have been updated with a method on how to add the new-builder label via gopherbot.

@cagedmantis
Copy link
Contributor

cagedmantis commented Aug 1, 2023

Please generate a new certificate signing request using solaris-amd64 as the hostname. I will clarify the documentation. The hostname will be what the host will appear as in the build environment. It's not necessarily what the actual hostname is.

@rorth
Copy link
Author

rorth commented Aug 1, 2023

I see. I'd already wondered if the fqdn would be desired here.
solaris-amd64.csr.txt

@cagedmantis cagedmantis added the NeedsFix The path to resolution is known, but the work has not been done. label Aug 1, 2023
@heschi heschi moved this to In Progress in Go Release Aug 1, 2023
@cagedmantis
Copy link
Contributor

solaris-amd64-1690923319.cert.txt
I've generated the cert and registered your bot.

@rorth
Copy link
Author

rorth commented Aug 2, 2023 via email

@cagedmantis
Copy link
Contributor

cc/ @golang/release

@cagedmantis
Copy link
Contributor

@roth We've looked into this error. The swarming bot doesn't seem to support solaris-amd64. Thanks for doing this work, it revealed that this would be an issue. We added the work to add support to our roadmap. I will comment on this issue once that work has started.

@dmitshur dmitshur moved this from In Progress to Planned in Go Release Aug 15, 2023
@joedian joedian moved this from Planned to In Progress in Go Release Aug 29, 2023
@heschi
Copy link
Contributor

heschi commented Aug 31, 2023

Hi @rorth, I think we're now in a state where running the latest bootstrapswarm will work. Please give it a try?

Regarding your earlier questions:

The -hostname flag to bootstrapswarm overrides the bot's hostname calculation; I believe the crash was unrelated.

There isn't any official way to control the builder's parallelism. Feel free to try setting GOMAXPROCS, especially if that' s what you were doing before, and we can verify that it's getting propagated into the child processes. But that will only have some effect, since there's a lot of parallelism across many processes.

It is currently hardcoded as $HOME/.swarming in bootstrapswarm, yes. If there's a compelling need to change it (or make it overridable) we can look into that.

@rorth
Copy link
Author

rorth commented Sep 5, 2023 via email

@heschi
Copy link
Contributor

heschi commented Sep 5, 2023

Just to eliminate any confusion, what --hostname argument did you pass?

@rorth
Copy link
Author

rorth commented Sep 5, 2023 via email

@heschi
Copy link
Contributor

heschi commented Sep 5, 2023

Oh, are you passing --token-file-path? I think that's an attractive nuisance at the moment -- bootstrapswarm understands it, but the actual Swarming bot doesn't, so its requests are coming in without a token. In this case I definitely agree that you should be able to override the token path, but it'll take a little while to get that implemented in the bot. In the meantime, can you use the default path of /var/lib/luci_machine_tokend/token.json just to see if there are other surprises in store?

@rorth
Copy link
Author

rorth commented Sep 6, 2023 via email

@heschi
Copy link
Contributor

heschi commented Sep 7, 2023

The bot seems to have died, possibly because I sent it work for the first time. Can you take a look? Thanks for your patience.

@rorth
Copy link
Author

rorth commented Sep 7, 2023 via email

@heschi
Copy link
Contributor

heschi commented Sep 7, 2023

No, that won't help much.

Thanks; I have a list of things to work on now and I'll let you know when it makes sense for you to try again.

  • allow token path override
  • fix OS and CPU dimensions to be solaris-amd64
  • fix cipd platform too
  • look into not requiring swarming user, but probably not feasible without a lot of work
  • allow override of swarming working dir from .swarming
  • disable reboots

@rorth
Copy link
Author

rorth commented Sep 7, 2023 via email

@heschi
Copy link
Contributor

heschi commented Sep 20, 2023

Hi @rorth, most of the stuff that needs doing should be done now. Can you try again?

I've added a couple of environment variables you can use:

  • Set SWARMING_ALLOW_ANY_USER to anything to disable the swarming user requirement. That said, it might be best if you leave it running under swarming for now, just to keep thing simple while we're still working on it.
  • Set LUCI_MACHINE_TOKEN to the path of the LUCI token to override the default /var/lib location.
  • Set SWARMING_NEVER_REBOOT to anything to prevent it from trying to reboot the machine.

@rorth
Copy link
Author

rorth commented Sep 21, 2023

Hi @rorth, most of the stuff that needs doing should be done now. Can you try again?

Sure, thanks for working on this.

I've added a couple of environment variables you can use:

* Set `SWARMING_ALLOW_ANY_USER` to anything to disable the `swarming` user requirement. That said, it might be best if you leave it running under `swarming` for now, just to keep thing simple while we're still working on it.

Will do: the swarming user exists now, anyway.

* Set `LUCI_MACHINE_TOKEN` to the path of the LUCI token to override the default /var/lib location.

* Set `SWARMING_NEVER_REBOOT` to anything to prevent it from trying to reboot the machine.

It seems your changes haven't made it to the repo yet: when I rebuild bootstrapswarm, none of those env variables are in the binary.

@heschi
Copy link
Contributor

heschi commented Sep 22, 2023

We got our first green build: https://luci-milo.appspot.com/ui/p/golang/builders/ci/x_oauth2-gotip-solaris-amd64/b8769217496445479345/overview. It won't be fully up and running until I put the final touches on, but I think we're in good shape now.

@heschi
Copy link
Contributor

heschi commented Sep 25, 2023

OK. Please give it one last restart.

@heschi
Copy link
Contributor

heschi commented Sep 26, 2023

It looks like the builder is doing fine on x/ repos, but hangs indefinitely when testing the main repo. (example: http://ci.chromium.org/b/8768937412553978545) Unfortunately result-adapter is swallowing all the test output so we have nothing to go on from our side. @rorth, can you see anything happening on the machine?

@rorth
Copy link
Author

rorth commented Sep 26, 2023

I don't find anything obvious in logs/swarming_bot.log. However, bootstrapswarm is currently running in the foreground and on the command's stdout I see three (so far) prompts:
Password: . I vaguely remember that something like this also happened with the old buildbot. Such a unanswered prompt could certainly explain the build not completing...

@rorth
Copy link
Author

rorth commented Sep 27, 2023

I've now converted the swarming bot to a proper Solaris SMF service. Let's see how it fares now...

@heschi
Copy link
Contributor

heschi commented Sep 27, 2023

On the one hand it's not hanging any more, but on the other it's crashing hard in the same test every time with a strange runtime bug that doesn't happen on the old builders:

https://ci.chromium.org/ui/p/golang/builders/ci-workers/gotip-solaris-amd64-test_only/b8768782473709459681/overview:

=== RUN   TestLookupDotsWithRemoteSource
runtime: port_getn on fd 3 failed (errno=9)
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw({0x682dcc?, 0x3?})
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1018 +0x5c fp=0x7fffbfffd110 sp=0x7fffbfffd0e0 pc=0x438d9c
runtime.netpoll(0xc00002c000?)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/runtime/netpoll_solaris.go:257 +0x46f fp=0x7fffbfffddb0 sp=0x7fffbfffd110 pc=0x43538f
runtime.findRunnable()
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/runtime/proc.go:3195 +0x845 fp=0x7fffbfffded8 sp=0x7fffbfffddb0 pc=0x440dc5
runtime.schedule()
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/runtime/proc.go:3589 +0xb1 fp=0x7fffbfffdf10 sp=0x7fffbfffded8 pc=0x442051
runtime.park_m(0xc0005069c0?)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/runtime/proc.go:3752 +0x11f fp=0x7fffbfffdf58 sp=0x7fffbfffdf10 pc=0x44255f
traceback: unexpected SPWRITE function runtime.mcall
runtime.mcall()
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/runtime/asm_amd64.s:458 +0x57 fp=0x7fffbfffdf70 sp=0x7fffbfffdf58 pc=0x46a697

Any idea what's going on there? I can ask someone from the runtime team to take a look if necessary.

@rorth
Copy link
Author

rorth commented Sep 27, 2023

I have no idea, unfortunately. I tried running the tests manually as the bot user inside the build tree (w/ir/x/w/goroot/src) with

go test -c net
./net.test -test.run=TestLookupDotsWithRemoteSource

and all tests just PASS this way.

I've got two questions:

  • When/why does a test enter the PAUSE state as obvserved for the failing tests?
  • Is there any better way to investigate those failures? I mean to run the exact command from the failing test, environment and all, but this will have to wait for tomorrow.

@rorth
Copy link
Author

rorth commented Sep 28, 2023

It seems there's something amiss with the JSON output: when I run

go tool dist test -run net.TestLookupGoogleSRV

I get the expected


##### Test execution environment.
# GOARCH: amd64
# CPU: Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
# GOOS: solaris
# OS Version: SunOS 5.11 11.4.63.155.0 i86pc

ALL TESTS PASSED (some were excluded)

while for

go tool dist test -run net.TestLookupGoogleSRV -json

the command returns with no output and exit status 0. Very weird.

I suspect the current buildbot (like running all.bash manually) uses the classical output while the swarming bot relies on JSON.

@heschi
Copy link
Contributor

heschi commented Oct 6, 2023

@mknyszek spotted the more interesting stack trace: it appears to be crashing inside libc. You mentioned configuring it as a Solaris service. Perhaps that's causing a problem somehow? It seems unlikely to be LUCI-related per se.

SIGSEGV: segmentation violation
PC=0x7fffbf161128 m=11 sigcode=1 addr=0x88f000
signal arrived during cgo execution

goroutine 900 [syscall]:
runtime.cgocall(0x401220, 0xc0001f9788)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/runtime/cgocall.go:157 +0x3e fp=0xc0001f9760 sp=0xc0001f9728 pc=0x405d1e
net._C2func_res_ninit(0x88bc50)
	_cgo_gotypes.go:207 +0x55 fp=0xc0001f9788 sp=0xc0001f9760 pc=0x6023d5
net._C_res_ninit.func1(0xc0001f97d8?)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/cgo_unix_cgo_resn.go:28 +0x34 fp=0xc0001f97c0 sp=0xc0001f9788 pc=0x602c94
net._C_res_ninit(0x228?)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/cgo_unix_cgo_resn.go:28 +0x13 fp=0xc0001f97d8 sp=0xc0001f97c0 pc=0x602c33
net.cgoResSearch({0x67facf, 0xb}, 0xc000133814?, 0x2e?)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/cgo_unix.go:324 +0x108 fp=0xc0001f9ac8 sp=0xc0001f97d8 pc=0x544828
net.resSearch.func1()
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/cgo_unix.go:314 +0x25 fp=0xc0001f9af8 sp=0xc0001f9ac8 pc=0x544705
net.doBlockingWithCtx[...]({0x6db2d0, 0x881440}, 0xc000213380)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/cgo_unix.go:45 +0x25f fp=0xc0001f9bc8 sp=0xc0001f9af8 pc=0x60ca5f
net.resSearch({0x6db2d0, 0x881440}, {0x67facf, 0xb}, 0x5, 0x1)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/cgo_unix.go:313 +0x9c fp=0xc0001f9bf8 sp=0xc0001f9bc8 pc=0x54467c
net.cgoLookupCNAME({0x6db2d0?, 0x881440?}, {0x67facf?, 0x6db2d0?})
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/cgo_unix.go:299 +0x45 fp=0xc0001f9d38 sp=0xc0001f9bf8 pc=0x544485
net.(*Resolver).lookupCNAME(0xc0001f9dd0?, {0x6db2d0, 0x881440}, {0x67facf, 0xb})
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/lookup_unix.go:92 +0xa5 fp=0xc0001f9d98 sp=0xc0001f9d38 pc=0x565d25
net.(*Resolver).LookupCNAME(0xc000133830?, {0x6db2d0?, 0x881440?}, {0x67facf, 0xb})
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/lookup.go:484 +0x2b fp=0xc0001f9de0 sp=0xc0001f9d98 pc=0x562a8b
net.LookupCNAME(...)
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/lookup.go:467
net.testDots(0xc000507520, {0x67d6f8, 0x3})
	/opt/golang/swarm/.swarming/w/ir/x/w/goroot/src/net/lookup_test.go:674 +0x12e fp=0xc0001f9ef0 sp=0xc0001f9de0 pc=0x5bc44e
net.TestLookupDotsWithRemoteSource(0xc000507520)

@rorth
Copy link
Author

rorth commented Oct 9, 2023

Thanks, that led me way further: the failure can be reproduced with

$ cd src/net && go test -c .
$ ./net.test
SIGSEGV: segmentation violation
PC=0x7fffbf161128 m=11 sigcode=1 addr=0x887000
signal arrived during cgo execution

goroutine 1004 [syscall]:
runtime.cgocall(0x401220, 0xc0000eb788)
        /opt/golang/swarm/goroot/src/runtime/cgocall.go:157 +0x3e fp=0xc0000eb760 sp=0xc0000eb728 pc=0x405d1e
net._C2func_res_ninit(0x8841b0)
        _cgo_gotypes.go:206 +0x55 fp=0xc0000eb788 sp=0xc0000eb760 pc=0x5ff955
net._C_res_ninit.func1(0xc0000eb7d8?)
        /opt/golang/swarm/goroot/src/net/cgo_unix_cgo_resn.go:28 +0x34 fp=0xc0000eb7c0 sp=0xc0000eb788 pc=0x600214
net._C_res_ninit(0x228?)
        /opt/golang/swarm/goroot/src/net/cgo_unix_cgo_resn.go:28 +0x13 fp=0xc0000eb7d8 sp=0xc0000eb7c0 pc=0x6001b3
net.cgoResSearch({0x67ca24, 0xb}, 0xc000014460?, 0x2e?)

Running the test under truss shows really weird exit handling: at the end of the run (before the SEGV) there's

/17:    read(4, "FE ]8180\001\001\004\007".., 1232)     = 265
/17:    port_dissociate(5, 4, 0x00000004)               Err#2 ENOENT
/17:    close(4)                                        = 0
/17:    zone_lookup(NULL)                               = 0
/17:    zone_getattr(0, ZONE_ATTR_BRAND, 0x7FFF733FF940, 256) = 8
/17:    door_info(3, 0x7FFF733FF910)                    = 0
/17:            target=540 proc=0x7FD53AC471D0 data=0xDEADBEED
/17:            attributes=DOOR_UNREF|DOOR_NO_CANCEL|DOOR_ON_TPD
/17:            uniquifier=1108       
/17:    door_call(3, 0x7FFF733FF990)                    = 0
/17:            data_ptr=0x7FFFBEEC0000 data_size=211
/17:            desc_ptr=0x0 desc_num=0
/17:            rbuf=0x7FFFBEEC0000 rsize=25600
/17:    close(1948263178)                               Err#9 EBADF
/17:    close(1635020399)                               Err#9 EBADF
[...]

many more with absurd fds, the a large number of

/17:    close(0)                                        Err#9 EBADF

ultimately

/17:        Incurred fault #6, FLTBOUNDS  %pc = 0x7FFFBF161128
/17:          siginfo: SIGSEGV SEGV_MAPERR addr=0x00887000
/17:        Received signal #11, SIGSEGV [caught]       
/17:          siginfo: SIGSEGV SEGV_MAPERR addr=0x00887000

gdb shows

Thread 18 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 8 (LWP 8)]
0x00007fffbf161128 in res_nclose () from /lib/64/libresolv.so.2
3: x/i $pc
=> 0x7fffbf161128 <res_nclose+56>:  mov    0x0(%r13),%eax
(gdb) bt
#0  0x00007fffbf161128 in res_nclose () from /lib/64/libresolv.so.2
#1  0x00007fffbf1611a5 in res_ndestroy () from /lib/64/libresolv.so.2
#2  0x00007fffbf15f5c6 in __res_vinit () from /lib/64/libresolv.so.2
#3  0x00007fffbf15f57b in res_ninit () from /lib/64/libresolv.so.2
#4  0x000000000040124c in net(.text) ()
#5  0x000000000046c351 in runtime.asmcgocall ()
    at /opt/golang/swarm/goroot/src/runtime/asm_amd64.s:872
#6  0x00007fff75fffe58 in ?? ()
#7  0x00000000004439b2 in runtime.exitsyscallfast.func1 ()
    at /opt/golang/swarm/goroot/src/runtime/proc.go:4271
#8  0x000000000046a713 in runtime.systemstack ()
    at /opt/golang/swarm/goroot/src/runtime/asm_amd64.s:509
#9  0x0000000000200000 in ?? ()
#10 0x000000c00030e1a0 in ?? ()
#11 0x00007fff75fffee0 in ?? ()
#12 0x000000000046a605 in runtime.mstart ()
    at /opt/golang/swarm/goroot/src/runtime/asm_amd64.s:394
#13 0x00000000004019f4 in runtime/cgo(.text) ()
#14 0x00007fffbf5e3240 in ?? ()
#15 0x0000000000000000 in ?? ()

which lets me suspect that Go gets res_state/struct __res_state * wrong somehow.

@rorth
Copy link
Author

rorth commented Oct 11, 2023

I've confirmed that now: struct __res_state has never been initialized after state has been allocated in net/unix_cgo.go:cgoResSearch. However, this is a clearly documented requirement in Solaris resolv(3RESOLV):

       State information is kept in statp and is used to control the  behavior
       of  these  functions.  Set statp to all zeros prior to making the first
       call to any of these functions.

This way, depending on the vagaries of memory allocation various fields in the struct are set to random values, e.g. _vcsock, _u._ext.nscount, and _u._ext.nssocks[]. This way, res_nclose tries to close a potentially large number of random values, which will all fail, and ultimately leads to the SEGV observed.

The attached patch seems to fix this (it's a bit hard to tell since the success of failure of the resolver tests depends very much on memory layout). net/cgo_unix_syscall.go will need a definition of _C_memset, too.

res.patch.txt

@heschi
Copy link
Contributor

heschi commented Oct 11, 2023

Thanks for investigating. Makes sense, but this is slightly out of my area of expertise and also we can't accept patches on GitHub issues. Would you mind sending a CL or PR? See https://go.dev/doc/contribute. If not I can dig into it deeper or find someone else to take a look.

@ianlancetaylor
Copy link
Contributor

I'll send a different patch. @rorth thanks for finding the problem.

@rorth
Copy link
Author

rorth commented Oct 11, 2023

Great, thanks a lot for your help.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/534516 mentions this issue: net: clear malloc'ed memory in cgoResSearch

gopherbot pushed a commit that referenced this issue Oct 11, 2023
For #61666

Change-Id: I7a0a849fba0abebe28804bdd6d364b154456e399
Reviewed-on: https://go-review.googlesource.com/c/go/+/534516
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@ianlancetaylor
Copy link
Contributor

The memory clearing patch is committed.

@heschi
Copy link
Contributor

heschi commented Oct 12, 2023

We got a green build! Thanks all.

@heschi heschi closed this as completed Oct 12, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Release Oct 12, 2023
yunginnanet pushed a commit to yunginnanet/go that referenced this issue Oct 20, 2023
For golang#61666

Change-Id: I7a0a849fba0abebe28804bdd6d364b154456e399
Reviewed-on: https://go-review.googlesource.com/c/go/+/534516
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@rorth
Copy link
Author

rorth commented Nov 29, 2023

The solaris-amd64 builder has been failing with an infra failure for a day now. I have no idea what that's supposed to mean and the logs (both on the gotip-solaris-amd64 build page nor the local logs gave me any clue what might be going on. Any suggestions?

@mknyszek
Copy link
Contributor

mknyszek commented Nov 30, 2023

@rorth It initially failed due to a bad rollout on the LUCI side, but then the machine got quarantined due to too many consecutive failures (that were not related to a build). I'll figure out how to get that resolved ASAP. Thanks for flagging this!

@mknyszek
Copy link
Contributor

mknyszek commented Nov 30, 2023

I'm told the machine just has to be rebooted. :( Sorry for the inconvenience. Whenever you get the chance, can you do that please? Thanks.

EDIT: Er, sorry, I misunderstood. Not the machine, just the swarming bot.

@rorth
Copy link
Author

rorth commented Nov 30, 2023

Anyway: I've just rebooted the zone.

@mknyszek
Copy link
Contributor

Thanks, it appears to be back online.

@rorth
Copy link
Author

rorth commented Dec 1, 2023

Right, thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done. new-builder
Projects
Archived in project
Development

No branches or pull requests

7 participants