evalPrimOp is a bottle-neck due to linear search

While the implementation of, e.g., `...ByteArray.evalPrimOp` is rather direct and elegant at the moment

```hs
newtype Name = BS8.ByteString

evalPrimOp :: PrimOpEval -> Name -> [Atom] -> Type -> Maybe TyCon -> M [Atom]
evalPrimOp fallback op args t tc = case (op, args) of

  -- newByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 False
    pure [MutableByteArray baIdx]


  -- newPinnedByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newPinnedByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 True
    pure [MutableByteArray baIdx]
```

It yields rather slow code. That is because GHC doesn't do the "obvious" trie-based match optimisation, which means we end up with a linear chain of comparisons

```
if op == "newByteArray#" then ,,,
else if op == "newPinnedByteArray#" then ...
...
```

in Core.

I took a profile (on `nofib`"s `bernoulli`, if that matters) and `...ByteArray.evalPrimOp` takes about 10% of time and allocation. Here is an excerpt of the profile

```
COST CENTRE                          MODULE                            SRC                                                       %time %alloc

evalPrimOp                           Stg.Interpreter.PrimOp.ByteArray  lib/Stg/Interpreter/PrimOp/ByteArray.hs:(71,1)-(884,28)     9.4   10.9
evalPrimOp                           Stg.Interpreter.PrimOp.Addr       lib/Stg/Interpreter/PrimOp/Addr.hs:(23,1)-(357,28)          6.5    8.2
evalStackContinuation.\              Stg.Interpreter                   lib/Stg/Interpreter.hs:(355,74)-(394,35)                    4.4    5.8
lookupEnvSO                          Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(630,1)-(648,21)                4.2    2.2
builtinStgEval                       Stg.Interpreter                   lib/Stg/Interpreter.hs:(154,1)-(201,103)                    3.2    2.9
evalPrimOp                           Stg.Interpreter.PrimOp.Float      lib/Stg/Interpreter/PrimOp/Float.hs:(17,1)-(120,28)         2.8    2.9
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(497,45)-(502,23)                    2.8    3.4
evalPrimOp                           Stg.Interpreter.PrimOp.Double     lib/Stg/Interpreter/PrimOp/Double.hs:(17,1)-(127,28)        2.7    3.0
evalExpr                             Stg.Interpreter                   lib/Stg/Interpreter.hs:(423,1)-(533,93)                     2.5    0.6
compare                              Stg.Syntax                        lib/Stg/Syntax.hs:(30,3)-(32,12)                            2.4    0.0
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(504,37)-(510,27)                    2.4    1.2
setInsert                            Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(776,1)-(778,36)                2.1    0.0
evalPrimOp                           Stg.Interpreter.PrimOp.Int        lib/Stg/Interpreter/PrimOp/Int.hs:(25,1)-(149,28)           2.0    2.1
evalPrimOp                           Stg.Interpreter.PrimOp.SmallArray lib/Stg/Interpreter/PrimOp/SmallArray.hs:(27,1)-(157,28)    1.9    1.9
```

I think a bit of focus on optimising `evalPrimOp` may well speed up the interpreter by 50%. One way to do so would perhaps be to use a `HashMap` or [Trie](https://hackage.haskell.org/package/bytestring-trie-0.2.7/docs/Data-Trie.html#t:Trie) to do the lookup.

Why optimise anyway? Because at the moment a single run of NoFib's `bernoulli` benchmark takes about half an hour when the compiled program takes just 0.1s. That's quite a deal breaker for an exhaustive benchmark run of all 11* benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evalPrimOp is a bottle-neck due to linear search #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development