- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 616
 
Create performance tips docs section #615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7909f93
              9aae6ae
              3786b3c
              6043599
              8b49e81
              2968481
              00d70de
              4d2b386
              2a96958
              ef428a5
              228e222
              f69ed62
              04f4c37
              1814999
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,76 @@ | ||||||
| # Performance Tips | ||||||
| 
     | 
||||||
| All the usual [Julia performance tips apply](https://docs.julialang.org/en/v1/manual/performance-tips/). | ||||||
| As always [profiling your code](https://docs.julialang.org/en/v1/manual/profile/#Profiling-1) is generally a useful way of finding bottlenecks. | ||||||
| Below follow some Flux specific tips/reminders. | ||||||
| 
     | 
||||||
| ## Don't use more precision than you need. | ||||||
| 
     | 
||||||
| Flux works great with all kinds of number types. | ||||||
| But often you do not need to be working with say `Float64` (let alone `BigFloat`). | ||||||
| Switching to `Float32` can give you a significant speed up, | ||||||
| not because the operations are faster, but because the memory usage is halved. | ||||||
| Which means allocations occur much faster. | ||||||
| And you use less memory. | ||||||
| 
     | 
||||||
| 
     | 
||||||
| ## Make sure your custom activation functions preserve the type of their inputs | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add empty line  | 
||||||
| Not only should your activation functions be [type-stable](https://docs.julialang.org/en/v1/manual/performance-tips/#Write-%22type-stable%22-functions-1), | ||||||
| they should also preserve the type of their inputs. | ||||||
| 
     | 
||||||
| A very artificial example using an activatioon function like | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "activatioon"  | 
||||||
| 
     | 
||||||
| ``` | ||||||
| my_tanh(x) = Float64(tanh(x)) | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems a bit artificial, perhaps something like  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted something very obvious, for the first example, the one below is less obvious.  | 
||||||
| ``` | ||||||
| 
     | 
||||||
| will result in performance on `Float32` input orders of magnitude slower than the normal `tanh` would, | ||||||
| because it results in having to use slow mixed type multiplication in the dense layers. | ||||||
| 
     | 
||||||
| Which means if you change your data say from `Float64` to `Float32` (which should give a speedup: see above), | ||||||
| you will see a large slow-down | ||||||
| 
     | 
||||||
| This can occur sneakily, because you can cause type-promotion by interacting with a numeric literals. | ||||||
| E.g. the following will have run into the same problem as above: | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
        Suggested change
       
    
  | 
||||||
| 
     | 
||||||
| ``` | ||||||
| leaky_tanh(x) = 0.01x + tanh(x) | ||||||
| ``` | ||||||
| 
     | 
||||||
| While one could change your activation function (e.g. to use `0.01f0x`) to avoid this when ever your inputs change, | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
        Suggested change
       
    
  | 
||||||
| the idiomatic (and safe way) is to use `oftype`. | ||||||
| 
     | 
||||||
| ``` | ||||||
| leaky_tanh(x) = oftype(x/1, 0.01) + tanh(x) | ||||||
| ``` | ||||||
| 
     | 
||||||
| 
     | 
||||||
| ## Evaluate batches as Matrices of features, rather than sequences of Vector features | ||||||
| 
     | 
||||||
| While it can sometimes be tempting to process your observations (feature vectors) one at a time | ||||||
| e.g. | ||||||
| ```julia | ||||||
| function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector}) | ||||||
| sum(zip(xs, ys)) do (x, y_target) | ||||||
| y_pred = model(x) # evaluate the model | ||||||
| return loss(y_pred, y_target) | ||||||
| end | ||||||
| end | ||||||
| ``` | ||||||
| 
     | 
||||||
| It is much faster to concatenate them into a matrix, | ||||||
| as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications. | ||||||
| Even though this means allocating new memory to store them contiguously. | ||||||
| 
     | 
||||||
| ```julia | ||||||
| x_batch = reduce(hcat, xs) | ||||||
| y_batch = reduce(hcat, ys) | ||||||
| ... | ||||||
| function loss_total(x_batch::Matrix, y_batch::Matrix) | ||||||
| y_preds = model(x_batch) | ||||||
| sum(loss.(y_preds, y_batch)) | ||||||
| end | ||||||
| ``` | ||||||
| 
     | 
||||||
| When doing this kind of concatenation use `reduce(hcat, xs)` rather than `hcat(xs...)`. | ||||||
| This will avoid the splatting penality, and will hit the optimised `reduce` method. | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems relevant to mention
Float32on GPU here. Also, operations do tend to be faster since you can fit more numbers in a SIMD lane at a given size.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to use the github suggestion feature, or to PR after it is merged.
I know little of GPU so someone else is better to write it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just remove the "not because the operations are faster, but because the memory usage is halved." part?