Description
While the implementation of, e.g., ...ByteArray.evalPrimOp
is rather direct and elegant at the moment
newtype Name = BS8.ByteString
evalPrimOp :: PrimOpEval -> Name -> [Atom] -> Type -> Maybe TyCon -> M [Atom]
evalPrimOp fallback op args t tc = case (op, args) of
-- newByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
( "newByteArray#", [IntV size, _s]) -> do
baIdx <- newByteArray size 1 False
pure [MutableByteArray baIdx]
-- newPinnedByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
( "newPinnedByteArray#", [IntV size, _s]) -> do
baIdx <- newByteArray size 1 True
pure [MutableByteArray baIdx]
It yields rather slow code. That is because GHC doesn't do the "obvious" trie-based match optimisation, which means we end up with a linear chain of comparisons
if op == "newByteArray#" then ,,,
else if op == "newPinnedByteArray#" then ...
...
in Core.
I took a profile (on nofib
"s bernoulli
, if that matters) and ...ByteArray.evalPrimOp
takes about 10% of time and allocation. Here is an excerpt of the profile
COST CENTRE MODULE SRC %time %alloc
evalPrimOp Stg.Interpreter.PrimOp.ByteArray lib/Stg/Interpreter/PrimOp/ByteArray.hs:(71,1)-(884,28) 9.4 10.9
evalPrimOp Stg.Interpreter.PrimOp.Addr lib/Stg/Interpreter/PrimOp/Addr.hs:(23,1)-(357,28) 6.5 8.2
evalStackContinuation.\ Stg.Interpreter lib/Stg/Interpreter.hs:(355,74)-(394,35) 4.4 5.8
lookupEnvSO Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(630,1)-(648,21) 4.2 2.2
builtinStgEval Stg.Interpreter lib/Stg/Interpreter.hs:(154,1)-(201,103) 3.2 2.9
evalPrimOp Stg.Interpreter.PrimOp.Float lib/Stg/Interpreter/PrimOp/Float.hs:(17,1)-(120,28) 2.8 2.9
evalExpr.\ Stg.Interpreter lib/Stg/Interpreter.hs:(497,45)-(502,23) 2.8 3.4
evalPrimOp Stg.Interpreter.PrimOp.Double lib/Stg/Interpreter/PrimOp/Double.hs:(17,1)-(127,28) 2.7 3.0
evalExpr Stg.Interpreter lib/Stg/Interpreter.hs:(423,1)-(533,93) 2.5 0.6
compare Stg.Syntax lib/Stg/Syntax.hs:(30,3)-(32,12) 2.4 0.0
evalExpr.\ Stg.Interpreter lib/Stg/Interpreter.hs:(504,37)-(510,27) 2.4 1.2
setInsert Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(776,1)-(778,36) 2.1 0.0
evalPrimOp Stg.Interpreter.PrimOp.Int lib/Stg/Interpreter/PrimOp/Int.hs:(25,1)-(149,28) 2.0 2.1
evalPrimOp Stg.Interpreter.PrimOp.SmallArray lib/Stg/Interpreter/PrimOp/SmallArray.hs:(27,1)-(157,28) 1.9 1.9
I think a bit of focus on optimising evalPrimOp
may well speed up the interpreter by 50%. One way to do so would perhaps be to use a HashMap
or Trie to do the lookup.
Why optimise anyway? Because at the moment a single run of NoFib's bernoulli
benchmark takes about half an hour when the compiled program takes just 0.1s. That's quite a deal breaker for an exhaustive benchmark run of all 11* benchmarks.