Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Go 2: Add implicit single program multiple data #58610

Closed
Bluebugs opened this issue Feb 20, 2023 · 39 comments
Closed

proposal: Go 2: Add implicit single program multiple data #58610

Bluebugs opened this issue Feb 20, 2023 · 39 comments
Labels
FrozenDueToAge LanguageChange Suggested changes to the Go language Proposal Proposal-FinalCommentPeriod v2 An incompatible library change
Milestone

Comments

@Bluebugs
Copy link

Author background

I have been doing Go development for a few years now and have had a long experience before that doing C, C++ development and also have experience writing OpenGL shaders.

Proposal

This is a proposal to add the capability to do implicit Single Program Multiple Data, SPMD, computation a la ISPC. It does not have any intent to cover intrinsic, but in a way cover portable SIMD. For the rest of the proposal, I will refer to implicit Single Program Multiple Data as iSPMD.

This proposal is a Go 2 language change proposal that will require change in the go language and large change in the compiler and go tools. I do understand that the bar for such a change is high, but I do think bringing iSPMD into Go might outweigh those difficulties and that people will consider the potential improvement this would bring to the Go ecosystem: readable, maintainable, safe, reliable, portable and fast vectorized code. This proposal should make iSPMD code usable and acceptable in all of Go stdlib and accessible to the entire ecosystem.

There has been already a few proposal trying to bring some form of vectorized code in Go, none looked from the iSPMD angle at it:
#35307
#53171

Due to the high burden of maintaining assembly code and the safety risk it pose, it is very hard to get SIMD optimization in the stdlib even when there is demand as the follow issues shows:
#53178
#24499
#20206
#19636

iSPMD could make it more manageable for projects like gonum to support more architecture and not have to hand write as much assembly for each architecture. Rasterx and go-text will likely benefit from it too. Any compute intensive workload would likely be better off with iSPMD than handwritten assembly. Vector databases like Weaviate have expressed interest in seeing the situation around SIMD improved.

An additional challenge of this proposal is that no language exists, to my knowledge, with both a iSPMD and a non iSPMD context natively. Nevertheless I believe it is doable and this is the aim of this proposal.

TL;DR;

Add the notion of iSPMD context where all code operations are done in parallel when using the keyword ispmd. Add the new keyword uniform to specify that a data is actually representing a single value for all parallel streams of computation in an iSPMD context. Structure and data declared outside of an iSPMD context are all by definition uniform implicitly. Structure and data declared inside of an iSPMD context are all by definition varying of the length of a SIMD register selected by the compiler. Add an each keyword when iterating to specify that each element of a gang comes from a different element in a range, simply loading one SIMD register per step of the for loop instead of loading the same value in each member of the gang.

Proposed syntax for iSPMD context could look like the following short example below and longer one at the end:

func Sum(a []int32) int32 {
   ispmd {
       var partialSum int32
       for _, v := each range a {
           partialSum += v
       }
       return ReduceAdd(partialSum)
   }
}


//go:ismpd AVX for ReduceAdd
func asmReduceAddAVX(a int32) uniform int32
//go:ismpd SSE4 for ReduceAdd
func asmReduceAddSSE4(a int32) uniform int32
//go:ismpd None for ReduceAdd
func noneReduceAdd(a int32) uniform int32 { return a }

For a developer trying to optimize a workload, they just have to pick which unit can be processed in parallel and switch to iSPMD context for it: number, character, word, string, pixel, line of pixels, image, glyph, and so on.

Context

For discussion, I highly recommend reading all of the history of ISPC here as it is full of very useful information to understand the context and benefit of iSPMD: https://pharr.org/matt/blog/2018/04/30/ispc-all with a lot of insight on bringing this to more conventional workload.

ISPC is not the only iSPMD solution in use today. Obviously anyone that wrote a GPU shader in OpenGL, Vulkan or DirectX did write an iSPMD program. Same applies to people writing OpenCL or CUDA compute programs.

In contrast, the claims from Matt Pharr regarding the lack of reliability of auto vectorisation by compiler is still true today with no solution in sight. Auto vectorisation is still not a solution relied upon for generating efficient code. Intrinsic is used a lot, but the amount of developers writing them is still limited and there is significant risk in writing intrinsic or assembly directly.

ISPC was inspired by C, but it did introduce the keyword launch which does pretty much the same things as the keyword go in Golang. As an example: https://github.com/ispc/ispc/blob/main/examples/cpu/mandelbrot_tasks/mandelbrot_tasks.ispc will not feel very far from what a Go iSPMD program could look like. ISPC was focused on just writing a small bit of your program that needed speed, but it enable more complex piece of optimized software like this BC7 encoder: https://github.com/richgel999/bc7enc_rdo/blob/master/bc7e.ispc which would be very hard to do with any other approach .

An interesting experiment from Saksham Jain and Sivaprasad Sudhir that used the Go parser to generate ISPC code from Go code for some specialized function. This was presented in there thesis final report: https://www.andrew.cmu.edu/user/sakshamj/15618/final.pdf and the result of their work can be seen here: https://github.com/sakjain92/govec . Please look at the result of their benchmark which shows the potential for iSPMD in Go. This indicates the potential for several fold speeds up on many compute tasks. This also proves that there is not necessarily the need for much change to actually get iSPMD code mixed with normal Go code and get significant performance out of it. As a readable example of what that looks like and can be compared to the previous ISPC file a mandelbrot example: https://github.com/sakjain92/govec/blob/master/blas/go/mandelbrot.go . The main drawback of this prototype is that it relies on CGo, has some weird syntax requirements to get things working and limitations in the code it generates. It does mean we couldn’t use it to improve any of the standard libraries.

There has also been attempts to make iSPMD work within C++ code with projects like: https://github.com/richgel999/CppSPMD_Fast/ or https://github.com/google/highway. It works and the syntax doesn’t feel bad, again with a mandelbrot example as it seems to be the benchmark: https://github.com/richgel999/CppSPMD_Fast/blob/master/mandelbrot_imp.h . This obviously demonstrates mixing iSPMD context and non iSPMD context, but not really with the help of the language.

Additionally for reference you can read https://www.codeproject.com/Articles/1223361/Benchmarking-NET-Core-SIMD-performance-vs-Intel-IS and https://api-depositonce.tu-berlin.de/server/api/core/bitstreams/7489a854-6557-496e-8b3e-30b47e67b0bf/content to get an idea about the performance of alternative solution and how well ISPC fare. It seems that the only benchmark (short of full intrinsic/assembly approach) that managed to generate better code is rust packed_simd+rayon approach: https://github.com/rust-lang/packed_simd/tree/master/examples/mandelbrot . There is definitively a speedup from their approach at the moment compared to ISPC output, but the code complexity seems quite higher and much less readable.

Proposal

The goal of this proposal is to enable iSPMD code in portion of a golang program that the compiler will be able to generate directly as a vectorized set of operation (see https://pharr.org/matt/blog/2018/04/21/ispc-volta-c-and-spmd for an explanation on how the compiler would do so). There should not be any dependency on CGo. It should not break existing programs as they could not be using the new keyword in a manner that would get the compiler confused. It should be easy for all Go compilers to properly handle an iSPMD section even if they do not have support for SIMD. It should give clear expectations to the developers of what is happening in a iSPMD section.

An iSPMD context in go will execute the code in parallel for each member of a gang using SIMD instruction and an execution mask to follow the execution flow of the code. It means that race conditions are possible when writing code using iSPMD and tools will need to help detect them. A classic example:

func Sum(a []int32) int32 {
   var sum int32
   ispmd for _, v := each range a {
       sum += v // We have a race condition here
   }
   return sum
}

I would expect the above example to actually not compile due to a type mismatch (uniform int32 vs varying int32), but it is nevertheless showing what race condition to expect.

We will need to be able to tag pieces of code to indicate that they are iSPMD pieces of code. For that purpose, a new keyword ispmd could be used to switch into an iSPMD context (I did think of using igo, but I am afraid this would be confusing as this is not creating any goroutine).

There is also a need to support the for iteration picking at each element in the slice independently. Basically loading a bunch of the slice in an entire SIMD register at a time. Putting all of that together, it should look like that:

func Sum(a []int32) int32 {
   ismpd {
       var partialSum int32
       for _, v := each range a {
           partialSum += v
       }
       return ReduceAdd(partialSum)
   }
}

The problem now is that from this context if we do a function call, we must go out of the iSPMD context. A solution to that is by reusing the ispmd keyword, we should be able to indicate to the compiler that a function is actually ok to continue executing it as part of the iSPMD context. This would look like:

ispmd func doSomething(a int, b int) int { return a + b }

When an iSPMD context calls a function, it should generate a function that matches the vector length, propagate the mask and not break out of the iSPMD context. If the function is not tagged, it should break out, iterate over the vector and go back into the iSPMD context after that (This means that when calling a non iSPMD function from an iSPMD context, due to the nature of all the code path in the iSPMD context being explored in parallel for a gang, they would look like being called in random order a bit like if they were called from multiple goroutine meaning there is potential for race condition that the developer need to be aware of). If the function is called from a non iSPMD context, the compiler could for now error out and not allow it even if we should always be able to generate a serial version as fallback.

It should be allowed to have iSPMD functions defined in interfaces.

It should be possible to provide multiple variants of the same function, but with different iSPMD architecture specified to be able to call specialized assembly functions. I could not think of different way than the following:

//go:ismpd AVX for ReduceAdd
func asmReduceAddAVX(a int) uniform int
//go:ismpd SSE4 for ReduceAdd
func asmReduceAddSSE4(a int) uniform int
//go:ismpd None for ReduceAdd
func noneReduceAdd(a int) uniform int { return a }

As shown in this example, we might need to be able to call assembly code for this proposal to express all its potential. This means that there should be a calling convention for calling assembly functions from an iSPMD context. This is mostly a detail of implementation and might vary between compilers. The main issue, here, is that it introduces a concept of overloading which doesn’t exist at this point in the Go ecosystem.

Also as stated earlier, it is left out of this proposal how to optimize the transition from iSPMD to assembly and this proposal does not address nor should block the support for intrinsic types of solution which are orthogonal to this proposal.

In terms of code generation, it will be expected that a version of the iSPMD context is generated per SIMD extension supported for the architecture targeted by the compiler along with a serial implementation to cover CPU that do not have the extension when necessary (The same logic applies on iSPMD tagged functions). If the compiler knows the SIMD extension and tries to generate code for it, it should never fail to generate the code. The compiler should generate the right code to select from the best SIMD extension for the architecture it is targeting and provide meaningful fallback depending on the host CPU capability.

It is also logical that you can not use an ispmd construction inside a iSPMD context (be it a function or inside a previous ispmd) as it doesn’t make sense to switch again to an iSPMD context when you are already in one.

We should trust that the compiler will generate code that for each of the SIMD extensions leads to the exact same computational result, but for testability purposes we should be able to have support in Test and Benchmark to force the SIMD extension choice when running them as we might not want to tests all combination all the time. And we might want to use PGO after running through a benchmark. It would also be necessary to upgrade staticcheck to help detect data race conditions inside an iSPMD context.

Additionally while reading https://pharr.org/matt/blog/2018/04/26/ispc-volta-more-on-performance and https://ispc.github.io/perfguide.html , it seems important to address the following keywords that exist in ISPC.

Specifying instruction set and mask size at compile time: with ISPC and cpp_spmd, it is necessary to specify the instruction set targeted and the size of the mask. This is not really simple for the developers and requires work for any newly supported architecture. It would be best to avoid having to specify it in most cases. It seems possible to me that the compiler could have a pass that explore each iSPMD context and figure out the smallest varying data type processed in that iSPMD context (For function it might just actually infer it directly from the parameters type). With that information the compiler should be able to infer the size of the mask (length of SIMD register/sizeof type) and avoid having the developer specify this information. This should enable a more portable code and easier to read.

uniform: which is a very common keyword in many GPU languages, indicates that the variable has the same value for the current iSPMD context and for all members of the gang. In the case of this proposal, any variable created outside of an ispmd block is automatically an uniform and any variable inside is a varying. A difficulty might arise when calling a function as you can now pass a variable that has been declared outside of the iSPMD context and is an uniform to a iSPMD function that expects only varying. It is also possible to need to declare a variable before we do an assignment and this would also mean that there will be a case where a uniform variable needs to be declared before the association that would allow for type information to be propagated. This information can apply on different levels of a variable type. If in an iSPMD context you are building an array filled with the content of a variable that comes from a non iSPMD context, you need to specify that the type inside the array is also an uniform. This means that, in an iSPMD context, we need the ability to define that a type is uniform. As an example:

ispmd func mandel(c_re float32 , c_im float32 , count uniform int) int32 {
    var test uniform []uniform int
}

I do not see how it would be possible to make the type system work without the uniform keyword which is the reason why the entire proposal is designed with the idea of deep change to the language.

It would be logical to ask the question if we do not need a varying to do the opposite of a uniform keyword. To define that a variable or a type coming from a iSPMD context could be accessed from a non iSPMD context or carry around by non iSPMD context or just define a useful structure type for our iSPMD context, we would need a way to define structure with iSPMD data type (ideally not just anonymous structure). As we have been using the keyword ispmd so far, we can likely continue and use that to define types that embed varying data. Something like that would do:

ispmd type something struct {
   x int
   y int
   id uniform int
}

In this case, if you can pass the structure around in non iSPMD context, only the id which is a uniform would be accessible and all the other fields wouldn’t be accessible to the non iSPMD context. For simplicity, the structure can likely only be allocated inside an iSPMD context and the non uniform field/varying field can only be used inside an iSPMD context.

cif: having to explore both branches of if might be quite costly especially if one is rarely walked (error handling for example). cif exists to tell ISPC to generate additional tests to avoid walking a branch that is usually not worth it (checking the content of the entire mask and jumping depending on it). This should be something equivalent to calling a if AnyTrue(test) { if test {} } with ispmd func AnyTrue(bool test) uniform bool. Not the prettiest syntax, but as this would be possible within the existing proposal without adding any keyword, I don’t think we need at this point to have an equivalent to cif.

foreach: this has two uses. First one it iterates on a different step for each gang. And second it does enable to generate a more optimized loop. The use of each range should be directly matched to ISPC foreach construct. So I do not think we need a foreach keyword or its feature at this point.

The question of how to iterate on just numbers when you do not have a slice to start from is open and I do not have a good answer at this point, nor do I know if it is really necessary in practice. A possibility would be to extend the use of each for the following:

ispmd {
   var partialSum int32
   for i := each 1..10 {
      partialSum += i
   }
}

The second foreach construct is the foreach tiled. I think we can skip it at the moment and look later on how to implement it as this might be related to things like multidimensional array/slice.

Summary

This proposal changes the language by adding 3 keyword: ispmd, uniform and each, the ability to switch to a iSPMD context when entering a block of code specified with ispmd and continuing into a iSPMD context when calling a function and ability to call assembly from within a iSPMD context directly. This change will impact the entire ecosystem and all tooling. It should yield significant performance gain for all computation and I would expect yield performance close to what ISPC has proven was doable.

The Go compiler might slow down a little bit at first, as it will need to have an additional pass and generate multiple variant of the same code, but as it does parse and encode, it should itself be able to use iSPMD context and so it should over time get faster when this proposal get used.

This is trying to still be a “minimal” proposal, but I would expect that over time new functions like the example ReduceAdd or AnyTrue above would be proposed in the stdlib depending on the Go ecosystem needs.

Fully understanding all the potential of the language will be harder, but at the same time it will make using Go as a teaching tool about parallelism one of the best solutions as alternatives would be more complex to approach or difficult to have access to. This could mean university potentially switching to teaching Go as part of the curriculum.

The benefit of having iSPMD directly in the language shouldn’t escape to anyone as parser, encoder, decoder, compression, compare, search, sort would be a good target to improve once the compiler has iSPMD capability. I can see a lot of functions inside the standard library that could use this, but as this is not a magical solution, the code will have to express this parallelism to actually benefit from this. This means code change and not just adding a bunch of iSPMD flags around (This is logical otherwise auto vectorisation would have succeeded by now).
Typical benchmark example

ismpd func mandel(c_re float32 , c_im float32 , count uniform int32) int32 {
   var z_re float32
   var z_im float32


   z_re = c_re
   z_im = c_im


   var i int32
   for i = 0; i < count; i++ {
       var new_re, new_im float32


       if (z_re * z_re + z_im * z_im > 4.0) {
           break
       }


       new_re = z_re*z_re - z_im*z_im
       new_im = 2.0 * z_re * z_im


       z_re = c_re + new_re
       z_im = c_im + new_im
   }
   return i
}


func SerialMandelbrot(
   x0 float32, y0 float32, x1 float32, y1 float32,
   width int, height int,
   startRow int, totalRows int,
   maxIterations int32,
   output []int32) {


   dx := (x1 - x0) / float32(width)
   dy := (y1 - y0) / float32(height)


   endRow := startRow + totalRows


   ispmd for i := each 0..width-1 {
      for j := 0; j < height; j++ {
           x := x0 + float32(i) * dx
           y := y0 + float32(j) * dy


           index = (j * width + i)


           output[index] = mandel(x, y, maxIterations)
       }
   }
}
@gopherbot gopherbot added this to the Proposal milestone Feb 20, 2023
@DeedleFake
Copy link

I am not an expert in the topics covered by this proposal but I can tell you right now this seems very unlikely to get accepted to me. Adding keywords to the language is a violation of the Go 1 compatibility guarantee in 99.9% of circumstances because it causes any existing code where those words are used as identifiers to break, and uniform and each in particular are almost definitely used in that way in a bunch of code bases.

@Bluebugs
Copy link
Author

Indeed, this is a valid concern, but I do not think you can actually use those keyword in this context in your code today. Sure you can create a type or a variable with those name, but you wouldn't be able to have them in the language grammar use in the same place. It will likely require a bit of gymnastic in the go parser to not break existing code, but I do think it is doable to implement this proposal without breaking existing program.

@DeedleFake
Copy link

DeedleFake commented Feb 20, 2023

What if I write?

func Example(uniform int32) {
  ...
}

Is that an unnamed argument of type uniform int32 or an int32 argument named uniform? Would it be diambiguated based on the usage of ismpd before func?

@Bluebugs
Copy link
Author

What if I write?

func Example(uniform int32) {
  ...
}

Is that an unnamed argument of type uniform int32 or an int32 argument named uniform? Would it be diambiguated based on the usage of ismpd before func?

In my proposal, you can not have an uniform outside of an ispmd context (as all variable outside of an ispmd context are by definition uniform). You would have to prefix the function with ispmd first to have this problem, which then isn't one as ispmd context didn't exist and so you would not run into this problem.

@ianlancetaylor ianlancetaylor added v2 An incompatible library change LanguageChange Suggested changes to the Go language labels Feb 21, 2023
@ianlancetaylor
Copy link
Contributor

I'm sorry, I don't understand this proposal. Can you give a short, clear explanation of what the ismpd mandel function is doing at the end of the proposal.

Go is intended to be a simple language that is very to understand. I'm sure I am missing something but right now this proposal does not seem simple nor easy to understand.

@Bluebugs
Copy link
Author

Bluebugs commented Feb 21, 2023

I'm sorry, I don't understand this proposal. Can you give a short, clear explanation of what the ismpd mandel function is doing at the end of the proposal.

It is a direct implementation of the Mandelbrot Set. As pointed out in the Context section, Mandelbrot Set are often used as a benchmark. The main reason for it to be used as a benchmark is that without a proper expression of parallelism in the control flow of the program, you can not use SIMD instruction for it (as the parallelism is not per iteration inside of a pixel, but between pixels). In this example, I used ispmd to tell that the iteration on x allow for data parallelism (processing a bunch of pixels per line in parallel). The ispmd before the func mandel indicate that this function can be executed in parallel on multiple pixels at once as there is no data race condition when doing so.

@ianlancetaylor
Copy link
Contributor

Thanks. What is the exact meaning of ismpd in ismpd func mandel(c_re float32 , c_im float32 , count uniform int32) int32 { ? Your comment above says that it means that the function can be executed in parallel on multiple pixels. As far as I can see that is already true, as the function is pure: it only depends on its inputs, not on global state. Therefore, two parallel calls to the function can't possibly interfere with each other. Or so it seems to me. So what is the effect of the explicit ismpd? Thanks.

@DeedleFake
Copy link

Is the usage of each in SerialMandelbrot() a mistake?

@Bluebugs
Copy link
Author

Thanks. What is the exact meaning of ismpd in ismpd func mandel(c_re float32 , c_im float32 , count uniform int32) int32 { ? Your comment above says that it means that the function can be executed in parallel on multiple pixels. As far as I can see that is already true, as the function is pure: it only depends on its inputs, not on global state. Therefore, two parallel calls to the function can't possibly interfere with each other. Or so it seems to me. So what is the effect of the explicit ismpd? Thanks.

There is two reason to specify ismpd, first to indicate that the function can process a bunch of things in parallel (Here pixels, but it could be anything). All ispmd function are not necessarily pure in that respect. It does open the question of is it necessary to specify ispmd for pure function, maybe not.

The second reason, is to be able to specify that some of the parameters or the returned value are uniform. A uniform as a parameter indicate that the passed value is the same for the entire bunch of things being processed. A uniform as a return value means that the function concluded something for that bunch of things and summarized it as just one uniform value (Not used in this example).

In this example, uniform is used to specify that the maximal number of iteration is the same for every things being processed by the mandel function. To avoid the grammar problem pointed by @DeedleFake, I could not think of another approach here (For example, introducing varying as a type without ispmd lead exactly to the kind of problem pointed above).

@Bluebugs
Copy link
Author

Is the usage of each in SerialMandelbrot() a mistake?

No, it isn't a mistake. It actually what make things work. It does mean here that each number will be processed once, simply loading a SIMD register with unique value starting from 0 and ending at width-1.

With that in mind, it iterate a bunch of pixels from each line at a time, doing all the vertical first (via the for j := 0; j < height; j++ {), then moving over on the horizontal axis.

@ianlancetaylor
Copy link
Contributor

In this example, uniform is used to specify that the maximal number of iteration is the same for every things being processed by the mandel function.

I don't understand how the Go compiler could use this information. I don't understand how it could affect the code generated for this function.

@Bluebugs
Copy link
Author

In this example, uniform is used to specify that the maximal number of iteration is the same for every things being processed by the mandel function.

I don't understand how the Go compiler could use this information. I don't understand how it could affect the code generated for this function.

From the compiler point of view, each line is directly manipulating SIMD register. Let's start inside SerialMandelbrot with line ispmd for i := each 0..width-1 {.

This first line switch to an ispmd context and allocate a mask that will have one bit for each element that a SIMD register can hold. The for each should result in i being a SIMD variable loaded with value going from 0 to the number of element that this SIMD instruction set can hold. The next line, for j := 0; j < height; j++ {, result in the "allocation" of a second SIMD variable which get 0 for all element of that varying. The compiler should be able to compute x, y and index directly as SIMD operation.

Now come the call to mandel. As mandel is an ispmd function manipulating varying float32, it can directly pass the mask, along with x and y as SIMD variable (maxIteration being an uniform, it is passed just as a single value). When entering the function, the mask is still all true for all elements.

With for i = 0; i < count; i++ {, it start by copying the mask, create a new SIMD variable initialized to 0 for all its members. It calculate, using SIMD operation, z_re * z_re + z_im * z_im and compare the result, again using SIMD, against 4. For all the case when the test is true, it break, which here means that for the corresponding bit of the current mask get set to zero and won't be restored until the for loop iteration is done. This means as i is a SIMD variable, the member for which the mask is set to zero will stop getting updated after break. new_re, new_im, z_re and z_im are updated using proper SIMD operation as nothing should prevent that. Again the mask should be taken into account to only update the element of the SIMD variable that are still enable. The loop is done iterating when the mask is all permanently set to zero (either because the i < count is doing so or because they have been set to zero by break). Once the loop is done, i is returned as a SIMD variable/register.

Back in SerialMandelbrot, the returned value can be saved in output. index being a SIMD variable of the same size as the returned value, it provide the location for all elements being returned to get saved at the right place in output.

Whatever SIMD instruction set, the compiler will with this always be able to generate SIMD code for a ispmd section. My explanation is maybe not the best/clearest and I think Matt does a better job than me explaining how the compiler work here: https://pharr.org/matt/blog/2018/04/21/ispc-volta-c-and-spmd and in a more general way too.

@ianlancetaylor
Copy link
Contributor

I still don't really understand, but let me try asking a different question. Does the ismpd keyword have any semantic meaning in the program? That is, is the result of the function affected in any way by the presence of the ismpd keyword?

@Bluebugs
Copy link
Author

I still don't really understand, but let me try asking a different question. Does the ismpd keyword have any semantic meaning in the program? That is, is the result of the function affected in any way by the presence of the ismpd keyword?

The ispmd indicate that the function will, for each call to the function, process a bunch (Intel use gang or lane in there documentation to describe that grouping of action) of pixels at a time (here, in mandel case). From a practical point of view here, mandel would on most Intel chip select AVX2 and process 8 pixels per function call (returning also the result for those 8 pixels per call). On the latest AMD with AVX512 it would be 16 pixels per call and on a Mac M1 with Neon, it would be 4 pixels per call.

@Bluebugs
Copy link
Author

Maybe a different angle to answer your question about semantic is to take a step back on what means ispmd, implicit Single Program Multiple Data, when applied on a block of code. This means that a ispmd block of code is implicitly able to process multiple data for every single execution of that block of code.

@ianlancetaylor
Copy link
Contributor

I understand that ismpd will somehow cause the function to be compiled differently. That's not what I'm asking. I'm asking whether ismpd will cause the function to return different results. To put it another way, if we completely remove ismpd, will the program produce exactly the same output? It may run slower, but will the output be the same?

@Bluebugs
Copy link
Author

Totally, in the same way goroutine doesn't mandate to have multi-core or thread, ispmd does not mandate to generate vectorized code. A scalar is after all a vector of size 1. I think it is a necessary property as that allow to always have a fallback.

@ianlancetaylor
Copy link
Contributor

Thanks. Since this doesn't affect semantics, that suggests that this should be a compiler directive, like //go:noinline, rather than a keyword. (The existing compiler directives are documented at https://pkg.go.dev/cmd/compile).

@Bluebugs
Copy link
Author

Wouldn't that make it less obvious when reading the code that there is potential for race condition?

@Bluebugs
Copy link
Author

How do you imagine adding uniform type information with the compiler directive?

@Bluebugs
Copy link
Author

Actually thinking more about this, if the question was "can you add ispmd to any function and expect the same result", the answer would be no. Does that change your recommendation?

@DeedleFake
Copy link

When you say that you can not expect the same result, do you mean that you could write a function that will return something different for the same arguments based purely on the presence of the ispmd annotation or that some functions will fail to compile one way or the other? In other words, are you talking about a run-time result or a compile-time result? If it's run-time, can you give an example of such a function?

@Bluebugs
Copy link
Author

The second code example show a potential race condition that should result ideally in a compile time failure. As for a run-time failure, a silly example like the following would give unpredictable result:

ispmd func lastSeenOddEven(seen uniform [2] uniform int, value int) {
   seen[value % 2] = value
}

@ianlancetaylor
Copy link
Contributor

This proposal is focused on code generation rather than language, and the details of how it changes the language are difficult to understand. Go is intended to be a simple language that it is easy to understand. Introducing complex semantics for performance reasons is not a direction that the language is going to take.

Therefore, this is a likely decline. Leaving open for four weeks for final comments.

@gedw99
Copy link

gedw99 commented Mar 3, 2023

+1 for this proposal, but i dont have my hope up for it getting in..

Its seems to be useful for so many usecases:

  • gpu
  • parsing data for ML

@DeedleFake
Copy link

From what I can tell, some possible usages of SIMD can be done automatically as a compiler optimization. I don't know if the Go compiler does such optimizations already or not, but, if it doesn't, would it be acceptable to you to change this proposal to add them, @Bluebugs? You could probably revisit the idea of more explicit usage later once the benefits can be shown in practice in the cases covered by the automatic optimizations, though I doubt that language changes to that end will ever be accepted so it would have to be via a new standard library package or something.

@Bluebugs
Copy link
Author

Bluebugs commented Mar 3, 2023

@DeedleFake I have pointed in my proposal to explanation why auto vectorisation is not an approach that as yielded anything so far. To paraphrase things, the code you write does not express enough information to allow for reliable vectorisation.

As SIMD is something that people are less familiar here with. Let's take the example of thread. You wouldn't be arguing with me here that the compiler can auto add thread to the code base if that code base was not written with it in mind. We have the go keyword to tell the language that it can spawn that function in a thread and it will be ok for the program.

The problem with simd is the same, the compiler can only be very restrictive on what it can vectorize, because it can not detect all race condition and has to switch to safest possible solution. The developer has to write his code with the constraint of the simd parallelism in mind in the same way a developer write a thread safe function. That what this proposal is about.

The difference between goroutine discussion and spmd/simd discussion is that most people were told in school what thread are and there is a larger culture and understanding of those. simd has been an assembly art mostly, with a bit of takeoff with GPU programming, but still a sub culture of computer science that most people do not understand and know about. This means it require additional learning for anyone to pick it. I do think this proposal would lead to the most readable and accessible syntax for writing code that generate simd instruction, but the community doesn't seem ready for something like it.

I consider that adding auto vectorization to Go will be a waste of everyone time and effort. I will prefer this proposal to be closed than to lead the ecosystem into something I think is wrong. If someone want to push in that direction, they should feel free to open a proposal for that purpose.

@gedw99
Copy link

gedw99 commented Mar 3, 2023

I am fascinated with SIMD and if this makes it easier to play and educate myself about SIMD that’s a bonus.

Would allow more Golang’s devs to optimise hotspots and grow the knowledge around using SIMD too.

@ghost
Copy link

ghost commented Mar 3, 2023

The problem with simd is the same, the compiler can only be very restrictive on what it can vectorize, because it can not detect all race conditions and has to switch to safest possible solution. The developer has to write his code with the constraint of the simd parallelism in mind in the same way a developer write a thread safe function. That what this proposal is about.

I disagree with the above viewpoint (highlighted in bold). A compiler can detect all SIMD race conditions fairly easily. A major problem is that usually the set of run-time race conditions is only a very small subset of compile-time race conditions, in which case one part of possible questions reduces to whether the compiler is a JIT compiler.

@Bluebugs
Copy link
Author

Bluebugs commented Mar 3, 2023

@atomsymbol oh, indeed, that's a good point. An assumption of this proposal is also that JIT is a no go.

I would think that even with a JIT, you would still have the need to express somehow cross lane operation to take full advantage?

@gedw99
Copy link

gedw99 commented Mar 6, 2023

Hey all,

this is related...

https://github.com/SnellerInc/sneller

using avx512 to search over JSON. Lots of SIMD acceleration ot make it fast.

@bdandy
Copy link

bdandy commented Mar 9, 2023

I'm not sure if I understand correctly the topic, but what about to use os/cpu (to be implemented), so it may look like:

import "os/cpu"

func Sum(a []int32) int32 {
   switch {
       case cpu.HasAVX():
            return asmReduceAddAVX(&a[0], len(a))
       case cpu.HasSSE4():
            return asmReduceAddSSE4(&a[0], len(a))
       default:
           return noneReduceAdd(&a[0],len(a))
   }
}

func asmReduceAddAVX(a *int32, size int32) int32
func asmReduceAddSSE4(a *int32, size int32) int32
func noneReduceAdd(a *int32, size int32) int32 { 
        var sum int32
        for _, b := range unsafe.Slice(a, size) {
            sum += b
        }
        return b
}

@Bluebugs
Copy link
Author

Bluebugs commented Mar 9, 2023

@bdandy the goal is that most of the code should not be written as assembly, but in Go directly. The compiler would most of the time generate the code. I would expect that in general developer won't use this feature and will rely on module that provide the few cross lane primitive that are useful (ispc as a good list of reduce and shuffle operation to start from).

The exception being for cross lane operation in this module. For those operation, I think they will be implemented in assembly, but the caller will already have done the architecture dispatching (as you need to specialize the entire ispmd context for one architecture). As the caller already did the dispatching, the selection of which function to call seems to be, to me, something that is more of a compiler directive.

Still your solution could likely work too. It would require exposing and passing also the lane mask to the assembly function. The benefit is that we wouldn't need to add compiler directive. And I think os/cpu can already be implemented today without any change to any think.

I also do think that ispmd make it possible to completely ignore what the underlying SIMD instruction set is. I think it is an important feature, as it get people to focus on the algorithm and code, instead of assembly. In a way, when developers moved away from assembly for scalar code a few decade ago in favour of higher level language, that what this proposal would do here for SIMD code. We have never programmed GPU in assembly, but we find it normal to write SIMD code using assembly even today.

@gedw99
Copy link

gedw99 commented Mar 9, 2023

Ah so an assembler generator written in golang ?

@Bluebugs
Copy link
Author

Bluebugs commented Mar 9, 2023

@gedw99 the go code written in the ispmd section can directly be converted to SIMD code. It is as much of an assembler generator as the go compiler is. The ispmd section give enough guarantee and hint to the compiler to safely generate SIMD code. For a full understanding of how the compiler can do that, I really recommend ready Matt post on the topic here: https://pharr.org/matt/blog/2018/04/21/ispc-volta-c-and-spmd

@gedw99
Copy link

gedw99 commented Mar 9, 2023

Thanks !

ok read that and the story of how the compiler came to be open sourced .. good story

So basically from go call this c code ? https://github.com/ispc/ispc

@gedw99
Copy link

gedw99 commented Mar 9, 2023

This looks similar effort in golang ?

https://www.andrew.cmu.edu/user/sakshamj/15618/final.pdf

@Bluebugs
Copy link
Author

Bluebugs commented Mar 9, 2023

@gedw99 yes, this proposal is totally influenced by this thesis and discussion with Matt. There are a few limitation to use ispc (mostly weird syntax and requirement on cgo). But the thing is, there is no reason for the go compiler to not be able to actually do the same things that ispc does, without calling ispc. This enable better syntax, full integration with go and no cgo.

@ianlancetaylor
Copy link
Contributor

No change in consensus.

@ianlancetaylor ianlancetaylor closed this as not planned Won't fix, can't repro, duplicate, stale Apr 12, 2023
@golang golang locked and limited conversation to collaborators Apr 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge LanguageChange Suggested changes to the Go language Proposal Proposal-FinalCommentPeriod v2 An incompatible library change
Projects
None yet
Development

No branches or pull requests

6 participants