Skip to content

cmd/compile: performance of go wasm is very poor #65440

Open
@liutao-liu

Description

@liutao-liu

Go version

go version go1.21.6 linux/arm64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPRIVATE=''
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOVCS=''
GOVERSION='go1.21.6'
GCCGO='gccgo'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3484327849=/tmp/go-build -gno-record-gcc-switches'

What did you do?

As shown in the following example,test.go is compiled into wasm:

test.go :

//go:noinline
func testFor() int {
	sum := 0
	for i := 0; i < 200; i++ {
		for j := 0; j < 10000000; j++ {
			sum += j
		}
	}
	return sum
}

func main() {
	startTime := time.Now()

	res := testFor()
	elapsed := time.Since(startTime)
	fmt.Println("Done. Cost: ", elapsed, res)
}

go to wasm compile command:
GOOS=wasip1 GOARCH=wasm go build -o test.wasm test.go

What did you see happen?

As you can see in the wat code, the for loop is expressed using the br_table operation.

wat code:

  (func $main.testFor (type 0) (param i32) (result i32)
    (local i32 i64 i64 i64 i64)
    global.get 0
    local.set 1
    loop  ;; label = @1
      block  ;; label = @2
        block  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      local.get 0
                      br_table 0 (;@9;) 1 (;@8;) 2 (;@7;) 3 (;@6;) 4 (;@5;) 5 (;@4;) 6 (;@3;) 7 (;@2;)
                    end
                    i64.const 0
                    local.set 2
                    i64.const 0
                    local.set 3
                    i32.const 2
                    local.set 0
                    br 7 (;@1;)
                  end
                  local.get 2
                  i64.const 1
                  i64.add
                  local.set 2
                end
                local.get 2
                i64.const 200
                i64.lt_s
                i32.eqz
                if  ;; label = @7
                  i32.const 4
                  local.set 0
                  br 6 (;@1;)
                end
              end
              i64.const 0
              local.set 4
              i32.const 6
              local.set 0
              br 4 (;@1;)
            end
            local.get 1
            i64.extend_i32_u
            i64.const 8
            i64.add
            i32.wrap_i64
            local.get 3
            i64.store
            local.get 1
            i32.const 8
            i32.add
            local.tee 1
            global.set 0
            i32.const 0
            return
          end
          local.get 4
          i64.const 1
          i64.add
          local.set 5
          local.get 3
          local.get 4
          i64.add
          local.set 3
          local.get 5
          local.set 4
        end
        local.get 4
        i64.const 10000000
        i64.lt_s
        if  ;; label = @3
          i32.const 5
          local.set 0
          br 2 (;@1;)
        end
        i32.const 1
        local.set 0
        br 1 (;@1;)
      end
    end
    unreachable)

What did you expect to see?

When the aot compiler of wasm runtime performs backend optimization, it is difficult to identify the br_table as a for loop. So, during the backend optimization, this for loop was not optimized.

I tested several of the most popular wasm runtimes, such as wasmtime, wamr, and wasmer, and I found that the performance of go wasm after aot compilation was very poor, and the runtime performance was only 20% of go native in the best case.
Why use so many br_table operation instead of loop operation? Will the performance of go wasm be optimized in the future?

Also, I found that the wat code of the go runtime functions uses br_table a lot,the craziest function has 417 hops in the br_table.

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancearch-wasmWebAssembly issuescompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions