-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flatten function #112
base: main
Are you sure you want to change the base?
Add flatten function #112
Conversation
Excellent. Make sure to benchmark it with rust as well and compare against symengine. What I recommend doing is to construct a series of representative benchmarks. I don't know if evaluating to a single number is actually useful, I think these large expressions are used to generate code (either C or via LLVM) that then evaluate the expression at runtime for many different "x". So then you want it as small as possible. |
Exactly but it is the graph size that matters, not the tree size because the graph size is the size of the LLVM code. You can do this already with protosym using llvmlite: In [1]: from protosym.simplecas import x, sin, lambdify
In [2]: e = sin(sin(sin(sin(sin(sin(x))))))
In [3]: %time ed = e.diff(x, 10, flatten=False)
CPU times: user 68.9 ms, sys: 3.11 ms, total: 72 ms
Wall time: 70.6 ms
In [4]: %time f = lambdify([x], e) # uses llvmlite
CPU times: user 38.8 ms, sys: 13.9 ms, total: 52.8 ms
Wall time: 51 ms
In [5]: f(1)
Out[5]: 0.5540163907556296
In [6]: %timeit f(1.0)
420 ns ± 0.893 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Here is the same with SymEngine and its LLVM backend: In [1]: from symengine import sin, symbols, lambdify
In [2]: x = symbols('x')
In [3]: e = sin(sin(sin(sin(sin(sin(x))))))
In [5]: %time ed = e.diff(x, 10)
CPU times: user 233 ms, sys: 12.8 ms, total: 246 ms
Wall time: 245 ms
In [7]: %time f = lambdify([x], [e])
CPU times: user 7.27 ms, sys: 3.07 ms, total: 10.3 ms
Wall time: 10.3 ms
In [8]: f(1)
Out[8]: array(0.55401639)
In [9]: %timeit f(1.0)
7.35 µs ± 57.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) So the protosym LLVM code times at 0.5 microseconds vs the SymEngine one at 7 microseconds. In this case the protosym generated expression is about 15x faster. |
That's a good benchmark. It runs too fast, I don't know if SymEngine has some overhead with passing in numpy arrays, but this kind of benchmark is very good, we just need to make the expression longer probably. What is a graph and a tree? I don't understand the difference. |
I don't think that is the issue. It is just that the LLVM code is different. I don't know how to see the code generated by SymEngine but you can see the protosym generated code easily: In [1]: from protosym.simplecas import x, sin, cos
In [2]: e = sin(sin(x)).diff(x, 2)
In [3]: e
Out[3]: ((((-1*sin(sin(x)))*cos(x))*cos(x)) + (cos(sin(x))*(-1*sin(x))))
In [4]: print(e.to_llvm_ir([x]))
; ModuleID = "mod1"
target triple = "unknown-unknown-unknown"
target datalayout = ""
declare double @llvm.pow.f64(double %Val1, double %Val2)
declare double @llvm.sin.f64(double %Val)
declare double @llvm.cos.f64(double %Val)
define double @"jit_func1"(double %"x")
{
%".0" = call double @llvm.sin.f64(double %"x")
%".1" = call double @llvm.sin.f64(double %".0")
%".2" = fmul double 0xbff0000000000000, %".1"
%".3" = call double @llvm.cos.f64(double %"x")
%".4" = fmul double %".2", %".3"
%".5" = fmul double %".4", %".3"
%".6" = call double @llvm.cos.f64(double %".0")
%".7" = fmul double 0xbff0000000000000, %".0"
%".8" = fmul double %".6", %".7"
%".9" = fadd double %".5", %".8"
ret double %".9"
} Notice that although the expression In [5]: from protosym.core.tree import forward_graph
In [6]: g = forward_graph(e.rep)
In [7]: for a in g.atoms: print(a)
-1
x
In [8]: for func, indices in g.operations: print(func, indices)
sin [1]
sin [2]
Mul [0, 3]
cos [1]
Mul [4, 5]
Mul [6, 5]
cos [2]
Mul [0, 2]
Mul [8, 9]
Add [7, 10] You can see that this is the same as the LLVM code (which is generated directly from this Almost all operations in protosym are based on this forward graph representation. The main thing that the Rust backend does to speed things up is just to compute this graph more quickly. With the graph even numerical evaluation in Python can be faster. We can query the size of the graph representation vs the tree representation: In [9]: e.count_ops_graph()
Out[9]: 12
In [10]: e.count_ops_tree()
Out[10]: 20 The difference between these two numbers is the cse saving. Evaluating the expression recursively from the top would require 20 operations (counting each atom as an operation as well). Evaluating the forward graph requires 12 operations including the atoms. Protosym computes symbolic derivatives using this forward graph representation and the forward accumulation algorithm that is sometimes used for computing numeric derivatives in automatic differentiation: The graph is always strictly smaller than the corresponding tree and it can be a lot smaller for large expressions that have many repeating subexpressions. We should always want to use this graph representation for numerical evaluation, differentiation, substitution etc. In protosym the only situation the tree representation is used is when building expressions like The automatic simplifications that SymPy and SymEngine perform often look like they make the expression simpler if you focus only on the tree form but what we should also be focusing on is what they do to the graph form because that is what we should be using for many computationally intensive operations. |
I see, so graph is a tree after CSE (and possibly some other simplifications) are performed on it? SymEngine also has CSE, so you can try it and compare it. We should be benchmarking it on some expression from PyDy. I think nested trig functions like sin(sin(x)) are usually quite rare, aren't they? |
They are but you can think of these functions as a placeholder for any operation. The only functions are sin and cos in protosym so all demonstrations use those! I haven't added more functions because you can easily add your own functions and this is a key point of the design: In [24]: from protosym.simplecas import Function, diff, a, x
In [25]: tan = Function('tan')
In [26]: diff[tan(a), a] = 1 + tan(a)**2
In [27]: tan(x**2).diff(x)
Out[27]: ((1 + tan(x**2)**2)*(2*x**(2 + -1)))
In [28]: %time ok = tan(tan(tan(x))).diff(x, 10)
CPU times: user 95.6 ms, sys: 2.58 ms, total: 98.2 ms
Wall time: 95.5 ms Adding your own functions is important because the intention is that protosym would not define any functions but would rather just be a framework that can be built on top of. E.g. SymPy could depend on protosym but could then define all the functions that it wants in the sympy codebase. It is important to understand here that the Here is an example that is just polynomial expressions which is probably closer to PyDy examples. In this case the polynomials are too large to be expanded the way that Flint/polys would (timings without rust_protosym): In [2]: from protosym.simplecas import Function, diff, a, x
In [3]: def make_expression(x, n):
...: e = x**2 + x
...: for _ in range(n):
...: e = e**2 + e
...: return e
...:
In [4]: make_expression(x, 1)
Out[4]: ((x**2 + x)**2 + (x**2 + x))
In [5]: make_expression(x, 2)
Out[5]: (((x**2 + x)**2 + (x**2 + x))**2 + ((x**2 + x)**2 + (x**2 + x)))
In [6]: make_expression(x, 3)
Out[6]: ((((x**2 + x)**2 + (x**2 + x))**2 + ((x**2 + x)**2 + (x**2 + x)))**2 + (((x**2 + x)**2 + (x**2 + x))**2 + ((x**2 + x)**2 + (x**2 + x))))
In [7]: e = make_expression(x, 100)
In [8]: e.count_ops_graph()
Out[8]: 204
In [9]: e.count_ops_tree()
Out[9]: 10141204801825835211973625643005
In [10]: %time ed = e.diff(x)
CPU times: user 10.2 ms, sys: 7.94 ms, total: 18.1 ms
Wall time: 18.1 ms
In [11]: %time ed.eval_f64({x:.001})
CPU times: user 3.31 ms, sys: 0 ns, total: 3.31 ms
Wall time: 3.32 ms
Out[11]: 1.2368849025211697
In [11]: from symengine import symbols
In [12]: x = symbols('x')
In [13]: e = make_expression(x, 100)
In [14]: %time ed = e.diff(x)
CPU times: user 14.2 ms, sys: 0 ns, total: 14.2 ms
Wall time: 14.7 ms
In [16]: %time ed.subs({x:.001})
CPU times: user 14.2 ms, sys: 3.36 ms, total: 17.6 ms
Wall time: 18 ms
Out[16]: 1.23688490252117 This is a contrived extreme benchmark that favours cse style simplification but it is also not completely unrealistic for the situations that arise in PyDy. They are not situations where any simplification really helps. Basically a large expression is built with symbolic arithmetic and it has products and powers that cannot be expanded. The expression always has large repeating subexpressions. They want to differentiate those large expressions and then evaluate numerically or do codegen. Of course here I used In [12]: %time e = make_expression(x, 1000)
CPU times: user 91.6 ms, sys: 0 ns, total: 91.6 ms
Wall time: 89 ms
In [13]: %time e = make_expression(x, 10000)
CPU times: user 463 ms, sys: 3.3 ms, total: 466 ms
Wall time: 464 ms
In [14]: from symengine import symbols
In [15]: x = symbols('x')
In [16]: %time e = make_expression(x, 1000)
CPU times: user 110 ms, sys: 12.1 ms, total: 122 ms
Wall time: 120 ms
In [17]: %time e = make_expression(x, 10000)
^C---------------------------------------------------------------------------
KeyboardInterrupt I interrupted that last step because it was seeming slow and I didn't want my computer to crash (as has happened when previously testing examples like this). |
Yes, it's a good benchmark to benchmark custom functions like that. Note that in LPython, I think we'll eventually support all functions, kind of like Mathematica does it (eventually thousands). For the PyDy use case, isn't it rare to need to create your own function with custom rules? When I started SymPy, this was the first thing I did, to make it easy, but then I think it's very rarely used I think. I never used it myself. |
PyDy does not need to create its own functions but SymPy does. If SymPy depends on something else then it is problematic if all of the functions and the rules for those functions are defined somewhere else without SymPy having any control over them.
You might not but SymPy's use as a library within the scientific Python ecosystem is significantly limited by not being easily extensible like this. There would be many more libraries building on top of SymPy it was easier to do this and if those new expression types would be fast: if SymPy does not make your custom symbolic expressions faster than writing your own implementation of Basic then you can just write your own. Plenty of libraries do that rather than building on top of SymPy or SymEngine because they are just not useful in that situation (SymPy is too slow and SymEngine is not extensible). This is why the Julia people decided that neither SymPy not SymEngine was suitable for what they needed. |
Actually this isn't true. PyDy does need to create its own functions for codegen etc. The functions would not be like sin or cos but rather things like |
SymPy is extensible, you can create your own class. You probably mean custom simplification rules? Regarding Julia, they wanted more flexibility, but when they stopped using SymEngine, they gave up performance, as my benchmarks show. You can definitely trade performance for flexibility. But the goal of SymEngine is maximum performance. |
Firstly you can't control what the existing types like In the examples above where the graph size increases because of Secondly the way to extend SymPy is far too difficult. Few libraries attempt it and in fact few SymPy developers can get it right either. The guide here is good but what it shows is that it is just much more difficult than it should be:
They gave up performance of some things in favour of flexibility but also to focus on performance in other things. They are now building many more useful things over the top of their symbolics than are built on top of SymEngine and SymPy like e.g.: The question is: in what situations does it actually make sense to do the particular things that SymEngine would make faster? If the goal is numerical evaluation then we want something like LLVM. If the goal is solving linear equations symbolically then we want something like Flint. If the goal is solving linear equations numerically then we want BLAS. Fast differentiation is important and is where SymEngine can speed up a calculation that might otherwise be slow with SymPy. In the context of the mechanics module the situation is basically that we want to solve a system of linear equations, differentiate and then eventually evaluate numerically (actually it is more complicated but these are the basic steps). The large expressions to be differentiated are generated by solving the system of linear equations symbolically. The differentiation is slow just because solving the system symbolically generates absurdly large expressions whose derivatives become even more absurdly large expressions. Actually though when the end goal is numerical evaluation it would be better not solve the equations symbolically: evaluate the matrices numerically and then use BLAS to solve the linear equations numerically. Doing it this way is both faster and more accurate. The question is how you get the derivatives and you get them from implicit differentiation. We have these equations and we want
So we solve and then differentiate:
This is slow with SymPy and SymEngine makes it a lot faster but it is still quite slow and both SymPy and SymEngine generate absurdly large expressions when doing this. Computing Actually though this is not the right way to do the calculation. Instead we should use implicit differentiation
Now the combined equations become
The expressions in this equation are all much smaller than in What the mechanics module needs is a way to represent the linear solve part implicitly:
We need this to be implicit rather than computing the solution in explicit symbolic form because when we codegen this the LinearSolve should become a call to BLAS. It turns out that doing the calculation this way means that the expressions are all small and what you really want is to have unevaluated things like
This is exactly what you didn't want: the inverse was supposed to be computed numerically later not symbolically now. Note that this is happening with the matrix expressions module that was designed to have more restricted automatic evaluation than the rest of SymPy but it still ultimately has the same problem. In the calculation above we never want On the other hand sometimes it does make sense to solve the linear equations symbolically because your end goal is to do symbolic things rather than numeric things. In that case though you need to compute the sort of symbolic inverse that Flint/DomainMatrix computes because you need the expressions to be structured rather than unstructured: In [26]: from symengine import symbols, Matrix
In [27]: x, y, z, t = symbols('x, y, z, t')
In [28]: M = Matrix([[x, y], [z, t]])
In [29]: M.inv()
Out[29]:
[(1 + y*z/(x*(t - y*z/x)))/x, -y/(x*(t - y*z/x))]
[-z/(x*(t - y*z/x)), (t - y*z/x)**(-1)] With DomainMatrix that is: In [30]: from sympy import symbols, Matrix
In [31]: x, y, z, t = symbols('x, y, z, t')
In [32]: M = Matrix([[x, y], [z, t]])
In [33]: M.to_DM()
Out[33]: DomainMatrix({0: {0: x, 1: y}, 1: {0: z, 1: t}}, (2, 2), ZZ[x,y,z,t])
In [34]: Mnum, den = M.to_DM().inv_den()
In [35]: Mnum.to_Matrix()
Out[35]:
⎡t -y⎤
⎢ ⎥
⎣-z x ⎦
In [36]: den.as_expr()
Out[36]: t⋅x - y⋅z Note here that this structured inverse will be in a much nicer form for differentiation: every expression is a ratio of two expanded polynomials. Obtaining the inverse in this structured form is more useful for subsequent symbolic calculations and also guarantees to avoid divide by zero and detects if the matrix is invertible etc. To compute this sort of structured symbolic inverse fast what is needed is DomainMatrix/Flint rather than expression trees. So if all of these things are being done in the best possible way (LLVM, BLAS, Flint, unevaluated expressions, ...) then the things that SymEngine accelerates become less relevant in the overall workflow. I think that is basically what the Julia folks have realised and why they are not trying to match SymEngine's speed for algebra and differentiation and are instead focusing on other things. |
Yes, there is nothing wrong with giving up performance for flexibility. As long as we are clear on this and communicate this clearly. There is also nothing wrong with optimizing the library for a different set of benchmarks than SymEngine is. Again, not a problem, as long as we are completely clear about that. So why don't you create a set of relevant benchmarks that you think we should optimize for? |
But we can also still use SymEngine for the things where it is faster so we don't need to make that tradeoff. What we really need is that it should be fast to convert from a SymPy expression to SymEngine and back because then it would be easy to speed up a particular calculation. for example in the mechanics module it would make sense to convert to SymEngine to compute a large derivative but you would need to be able to quickly convert from SymPy to SymEngine and back for it to be worthwhile. I considered the same thing with protosym e.g. (without rust_protosym): In [9]: from sympy import symbols, cos, sin
In [10]: %time e = sin(sin(sin(sin(x))))
CPU times: user 4.21 ms, sys: 9 µs, total: 4.22 ms
Wall time: 4.2 ms
In [11]: %time ok = e.diff(x, 4)
CPU times: user 340 ms, sys: 73 µs, total: 340 ms
Wall time: 338 ms
In [12]: from protosym.simplecas import Expr
In [13]: %time e_proto = Expr.from_sympy(e)
CPU times: user 485 µs, sys: 18 µs, total: 503 µs
Wall time: 518 µs
In [14]: %time x_proto = Expr.from_sympy(x)
CPU times: user 202 µs, sys: 0 ns, total: 202 µs
Wall time: 216 µs
In [15]: %time ed_proto = e_proto.diff(x_proto, 4)
CPU times: user 6.31 ms, sys: 0 ns, total: 6.31 ms
Wall time: 6.26 ms
In [16]: %time ed_sympy = ed_proto.to_sympy()
CPU times: user 125 ms, sys: 2 µs, total: 125 ms
Wall time: 123 ms So in this example SymPy takes 338 milliseconds to compute a derivative. Converting to protosym takes effectively no time in comparison to this and differentiating in protosym takes 6 milliseconds (and would be faster if rust_protosym was used). The problem is that converting back to a SymPy expression takes 120 milliseconds. If that last step was fast then it could be possible to make all sorts of things fast using SymEngine or anything else in particular places. As long as converting back to a SymPy expression is slow then that would always be the speed limiting factor. So why is it slow to convert back to a SymPy expression? It is because just creating SymPy expressions is slow due to SymPy's automatic evaluation code being slow. It gets more noticeable in cases like this where a calculation is done outside of SymPy because when you try to convert back you are trying to create lots of expressions that aren't in SymPy's cache (otherwise this slowness would be spread out over time and might seem like it was part of diff: actually it is just If SymPy had a non-evaluating expression type then that conversion step could be made much faster. Many things could be made much faster if there was just some way to create SymPy expressions quickly when evaluation is not needed. The expression in the example above is not that complicated: In [21]: ed_proto.count_ops_graph()
Out[21]: 106 We only need to create 106 expression nodes and even in Python it should be possible to create a node much more quickly than 1 millisecond (i.e. 1 million CPU instructions). |
I think we need to be able to construct SymPy expressions and disable automatic evaluation --- obviously we are then responsible to ensure they are in the correct form (so we can't create |
If SymPy had an unevaluated expression type then it would be possible to convert to that without worrying about this. |
CC @certik
Adds a flatten function method that can be used for common simplifications involving
Add
,Mul
,Pow
andInteger
:While I want to have a flatten function I don't want it to be used by default. While some of these simplifications are unambiguously good others are not which is why this sort of thing needs to be controlled.
The code implementing
flatten
is not great. I just put it together quickly while I think of a better way of structuring it.The
flatten
function is used by default insidediff
but can also be disabled:I don't think I want this function to be used by default in
diff
. A much more limited version of this would be good because the current behaviour on main gives incorrect derivatives after a few rounds of differentiation:All that is needed to fix that though is flattening an
Add
of only integers (constant folding) and applyingPow(x, 1) -> x
(which is an unambiguously good simplification). Probably just a specialised derivative rule fordiff(x**2, x) -> Mul(2, x)
could do that. I think that I will try a more limited version of this PR that only has those two rules: constant folding and handlingx**2
better in differentiation.This PR would fix that but also does too much:
This makes this particular benchmark faster e.g. without main we have (not using rust_protosym):
However with this PR we have (not using rust_protosym):
In this example flattening is helpful because all derivatives of this expression can be reduced down. However this is not actually a common case in practice because it is a high order derivative of a simple expression that happens to simplify. The best algorithm for a case like this would be something specialised to high-order derivatives. In practice what we want from an efficient implementation of diff is mostly to be able to efficiently compute 1st and 2nd order derivatives of large expressions and in most cases the sort of reduction seen for high derivatives of
x**10*sin(x)
will not be seen.The reason that I use benchmarks like
sin(sin(sin(sin(sin(sin(x)))))).diff(x, 10)
is that I know that the derivatives cannot simplify by much and so after a few iterations we will find ourselves necessarily find ourselves differentiating large expressions. What I want to benchmark is how performance scales for those large expressions.The problem with the approach in this PR is that some of the rules applied by flatten increase the size of the expression graph:
You can see here that the expression tree representation is made smaller by flattening but the expression graph representation is made larger. We generally want to work with the graph though because it is always smaller than the tree. These differences can all become more extreme:
There are various improvements that can be made to the flattening here. SymEngine's flattening actually computes a smaller derivative expression. The difference is most likely handling things like
a*b + b*a -> 2*a*b
which I haven't implemented here yet. These are the numbers for SymEngine's derivative of the above expression (after converting to protosym via SymPy which might have altered the expression):We can see here that while SymEngine's derivative has a smaller tree representation it still has a larger graph representation than the derivative computed by protosym without flattening. What I want to do here is to shrink that graph representation by identifying precisely the right simplification rules and apply only those rules.