Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadamard product? #1083

Open
personalityson opened this issue Feb 4, 2017 · 27 comments
Open

Hadamard product? #1083

personalityson opened this issue Feb 4, 2017 · 27 comments

Comments

@personalityson
Copy link

Elementwise vector product (like dot, but without the reduction to one number)
Does it exist?

In MKL its vdMul, in cuBLAS its DHAD (so, not part of the standard BLAS function list)...

Hadamard product is completely central in machine learning, and is the only reason which prevents me from using OpenBLAS...

@martin-frbg
Copy link
Collaborator

Does not exist currently but sounds interesting. Would you happen to have some rough number for the speedup from using MKL compared to just letting a modern compiler optimize a naive implementation ?

@brada4
Copy link
Contributor

brada4 commented Feb 4, 2017

It used to be in BLAS.... Still in lapack 3.1.1
it will be 3/2 execution time of DOT (3 memory accesses in place of 2 for dot per multiplication), and _dot is only C
Probably C macro will perform much better than a function call...

@martin-frbg
Copy link
Collaborator

@brada4 the only trace of this function appears to be a spurious entry in the intro_blas1 manpage distributed with lapack 3.1.1

@personalityson
Copy link
Author

Just learned there is a walkaround by using sbmv
http://stackoverflow.com/questions/7621520/element-wise-vector-vector-multiplication-in-blas

Elementwise matrix*matrix product would still be nice...

@martin-frbg
Copy link
Collaborator

Not sure abusing sbmv for this would be more efficient than a simple loop over all elements ? BTW the only references for *HAD functions in cuBLAS appear to be people asking if there are any...

@brada4
Copy link
Contributor

brada4 commented Feb 5, 2017

Now imagine your sbmv multiplying 2 million-element vectors ...
There is less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1) (by magnitude slower than a loop)
dot is simple loop. I say macro because for small samples library call would be enormous time waste.

@personalityson
Copy link
Author

less wasteful way with _gemv or _gemm (i.e free dimension(s) ==1)

Please explain... Do you extract the diagonal from the resulting matrix somehow? How is it less wasteful?

@brada4
Copy link
Contributor

brada4 commented Feb 6, 2017

Diagonal? It is not square.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 6, 2017

Not sure I get the gemm/gemv idea either, but it could be instructive to see a quick benchmark comparison of that sbmv suggestion from stackoverflow against a simple loop that the compiler can unroll as needed.
As far as I could find out, the *HAD functions were a brainchild of Cray/SGI in the mid-1990s (proposed for future inclusion in LAPACK at some netlib developer meeting in 1995) and part of their SCSF library but never universally supported.

@personalityson
Copy link
Author

The only way I see with gemv/gemm is to multiply two vectors v1T * v2 (of the same size), which returns a square matrix, and the diagonal is actually the hadamard product.
I spent the whole weekend looking if there were some tricks with lda/incx, but I'm too dumb to find anything. Maybe there is? :)

sbmv allows the input matrix to be stored in banded matrix form. With number of super-diagonals = 0 you only need to supply the diagonal, which in our case is one of the vectors, so it's not so bad...

I should explain the context to my problem:
I'm on a work computer where I do not have admin rights, and cannot install or compile anything.
I use Excel VBA to declare function calls to libopenblas.dll in order to do some matrix crunching.
(When I write that hadamard is "the only reason which prevents me from using OpenBLAS" -- I'm bluffing. mkl_rt.dll does not allow stdcall declarations from VBA, or maybe there is another reason, in any case I can only use OpenBLAS.)
So, my two options are sbmv or a C Macro VBA macro for hadamard, and ideally I would want to have it outside of VBA...

@brada4
Copy link
Contributor

brada4 commented Feb 6, 2017

swap dimensions of v2 and get HAD
swap dimensions of v1 and get DOT

@grisuthedragon
Copy link
Contributor

The Hadamard product is obviously a memory bounded operation, so the only thing one had to take care of is that the matrices are not traversed orthogonal to their storage scheme. If the matrices are accesses in the right way one has 3 loads, 1 multiplication and one store. The only point I see where one can optimize something is if we assume $ C := \alpha A\circle B + \beta C $ with $\beta=0$ and using non-temporal stores for C. In this case we only have 2 loads and one store per multiplication but requires aligned access to C and C must be larger than the L3 cache of current Intel-CPUs to give good results.

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2017

in common case one can treat matrices as 1:(M*N) vectors and apply marginal case of gemm / gemv

@personalityson
Copy link
Author

I still don't see it. Please give us a concrete example
Assume size(v1) = size(v2)

@brada4
Copy link
Contributor

brada4 commented Feb 7, 2017

FUNCTION DHAD2(N,A,B,C)
DGEMV('N',N,1,1.0,...

@personalityson
Copy link
Author

dgemv "N", n, 1, 1.0, v1, n, v2, 1, 1.0, res, 1
This returns v1 multiplied by the first element in v2...

@hiro4bbh
Copy link

hiro4bbh commented Feb 9, 2017

I think we can do hadamard products using xtbmv with diagonal matrix A as the following:

cblas_dtbmv(CblasColMajor, CblasUpper, CblasNoTrans, CblasNonUnit, n, 0, p, 1, q, 1);

This code does hadamard product of vectors p and q whose length are n,which writes result to q.

@hiro4bbh
Copy link

hiro4bbh commented Feb 9, 2017

Sorry, I recommended to use xtbmv for Hadamard product.
However, according to tbmv_thread.c, using dtbmv for Hadamard
product is inefficient (calling xaxpy for each elements?), so I recommend
to use your own Hadamard product routine in C.
Furthermore, in optimized codes on some platforms, there are mysterious
bugs, and I have no idea to solve its problem, because it happens sometimes...

@martin-frbg
Copy link
Collaborator

@personalityson certainly an unexpected but interesting use case.
I do now understand the merit of having the *had functions in OpenBLAS even if just for "convenience" without any particular speed benefit. Coding them as simple multiplication loops is very likely to be more efficient than "abusing" a special case of some more general function. As regards the mysterious bugs in optimized code, I trust the common goal is to get rid of them :-)

@personalityson
Copy link
Author

Btw, the stdcall requirement normally needed for dlls to work in VBA only applies to 32bit dlls
I saw an old feature request from 2014 also related to VBA:
#423
In any case 64bit bins work fine in 64bit version of Office

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 9, 2017

Looking at the declaration of xHAD which has alpha and beta multipliers, incx, incy and even incz for a sparse result vector, sbmv/tbmv actually could be a good choice. The MKL-style vxMul on the other hand appears to support only the simple multiplication of N elements of vectors x and y to yield vector z.

@loadwiki
Copy link

Hello everybody, I am looking forward this function too! My program does a video tracking task like this OpenCV tracking algorithm. The algorithm does a lot of point wise complex product in Fourier domain. The implementation in OpenCV is cv::mulSpectrum() while it is the bottle neck. My program will run on multi platform such as a x86 and ARM. OpenBLAS is my first choice from consideration of cross-platform and high performance. However, it is a pity to lack point wise vector product. Eigen and MKL do have this function.

@brada4
Copy link
Contributor

brada4 commented Apr 27, 2018

You can accelerate parts of Eigen using BLAS (OpenBLAS, MKL, Accelerate Framework)
https://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html
The "this" function in MKL is not part of BLAS functions, but of other group, that is not provided in OpenBLAS.

@MaxBarraclough
Copy link

Hi folks, I don't have anything to add other than to second @loadwiki , this would be a useful addition.

@brada4
Copy link
Contributor

brada4 commented Mar 24, 2021

It is in all viable macro libraries, memory bound, there wont be any improvement in speed adding it here over generic BLAS. I'd bet on modern compilers actually getting totally naive loop reasonably vectorised.

@martin-frbg
Copy link
Collaborator

there wont be any improvement in speed

err, not sure I'd want to underwrite that...

@brada4
Copy link
Contributor

brada4 commented Mar 25, 2021

Non-destructive AXPY that is 3 memory cells per MUL in place of original 2 ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants