Skip to content

[arm64] seqdec llTable access is extremely slow when profiled #466

@lizthegrey

Description

@lizthegrey

For some reason, the array access to llTable on seqdec.go:283 profiles as being significantly slower on arm64 than the mlTable and ofTable accesses, taking 5-10x longer than any other similar access. Disassembly attached below (note that the time is misattributed to :285, but after reordering the code order, I was able to get the problem to show up in the llTable access). this suggests the processor reports llTable instruction has succeeded in the program counter, but then stalls before it can perform the :285 instructions.

     260ms      320ms     540cd0: NOOP                                    ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:281
         .          .     540cd4: MOVD 344(RSP), R8                       ;bitreader.go:60
         .          .     540cd8: MOVD 32(R8), R9
     110ms      110ms     540cdc: MOVBU 40(R8), R10                       ;github.com/klauspost/compress/zstd.(*bitReader).getBitsFast bitreader.go:61
      10ms       10ms     540ce0: ADD R10, R5, R11
         .          .     540ce4: MOVB R11, 40(R8)                        ;bitreader.go:61
      10ms       10ms     540ce8: ADD R6, R4, R11                         ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:282
         .          .     540cec: AND $31, R11, R11                       ;seqdec.go:282
      20ms       20ms     540cf0: AND $63, R10, R10                       ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode bitreader.go:60
         .          .     540cf4: LSL R10, R9, R9                         ;bitreader.go:60
         .          .     540cf8: ORR $64, ZR, R10
      30ms       30ms     540cfc: SUB R5, R10, R5                         ;github.com/klauspost/compress/zstd.(*bitReader).getBitsFast bitreader.go:60
         .          .     540d00: AND $63, R5, R5                         ;bitreader.go:60
         .          .     540d04: LSR R5, R9, R5
         .          .     540d08: MOVW R5, R5                             ;bitreader.go:62
         .          .     540d0c: ASR R11, R5, R9                         ;seqdec.go:282
      60ms       60ms     540d10: ADD R3>>16, R9, R9                      ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:283
         .          .     540d14: UBFIZ $3, R9, $9, R9                    ;seqdec.go:283
         .          .     540d18: MOVD 256(RSP), R11
      20ms       20ms     540d1c: MOVD (R11)(R9), R3                      ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:283
     1.10s      1.10s     540d20: AND $31, R6, R9                         ;github.com/klauspost/compress/zstd.(*sequenceDecs).decode seqdec.go:285
         .          .     540d24: ASR R9, R5, R9                          ;seqdec.go:285
         .          .     540d28: UBFIZ $1, R4, $4, R12                   ;seqdec.go:286

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions