Merge #1150

1150: generalize and homogenize losses r=CarloLucibello a=CarloLucibello In order to enforce some consistency in the loss interface, this PR does the following: - adds to every loss an `agg` keyword, which defaults to the function `mean` . This defines the aggregation type (typically `mean` or `sum`). One can use `identity` for no aggregation. - add a `dims` keyword when meaningful. - fix other little inconsistencies among the losses For instance, the crossentropy definition becomes ```julia function crossentropy(ŷ, y; dims=1, agg=mean, ϵ=eps(eltype(ŷ))) agg(.-sum(y .* log.(ŷ .+ ϵ); dims=dims)) end ``` Co-authored-by: CarloLucibello <carlo.lucibello@gmail.com>
FluxML · Jul 1, 2020 · 822f13c · 822f13c
2 parents 5d93bc7 + b81552a
commit 822f13c
Show file tree

Hide file tree

Showing 15 changed files with 231 additions and 234 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -1,12 +1,12 @@
 # v0.11
-* Add [kaiming initialization](https://arxiv.org/abs/1502.01852) methods: `kaiming_uniform` and `kaiming_normal` [https://github.com/FluxML/Flux.jl/pull/1243]
-* Change to `DataLoader`'s constructor [https://github.com/FluxML/Flux.jl/pull/1152]
-* Use `DataLoader` with `NamedTuple`s, so that tensors can be accessed by name [https://github.com/FluxML/Flux.jl/pull/1221].
-* Error if Dense layers weights and biases are not arrays [https://github.com/FluxML/Flux.jl/pull/1218].
-* Add `Adaptive Pooling` in Flux layers [https://github.com/FluxML/Flux.jl/pull/1239].
-* Optimistic ADAM (OADAM) optimizer for adversarial training [https://github.com/FluxML/Flux.jl/pull/1246].
 
-# v0.10.5
+* Add [kaiming initialization](https://arxiv.org/abs/1502.01852) methods: [kaiming_uniform and kaiming_normal](https://github.com/FluxML/Flux.jl/pull/1243)
+* Use `DataLoader` with `NamedTuple`s, so that tensors can be accessed [by name](https://github.com/FluxML/Flux.jl/pull/1221).
+* Error if Dense layers weights and biases are [not arrays](https://github.com/FluxML/Flux.jl/pull/1218).
+* Add (Adaptive Pooling)[https://github.com/FluxML/Flux.jl/pull/1239] in Flux layers.
+* Change to `DataLoader`'s [constructor](https://github.com/FluxML/Flux.jl/pull/1152)
+* Uniform loss [interface](https://github.com/FluxML/Flux.jl/pull/1150)
+* Optimistic ADAM (OADAM) optimizer for [adversarial training](https://github.com/FluxML/Flux.jl/pull/1246).
 * Add option for [same padding](https://github.com/FluxML/Flux.jl/pull/901) to conv and pooling layers by setting `pad=SamePad()`.
 * Added option to set `bias` to [Flux.Zeros](https://github.com/FluxML/Flux.jl/pull/873) to eliminating `bias` from being trained.
 * Added `GlobalMaxPool` and `GlobalMeanPool` [layers](https://github.com/FluxML/Flux.jl/pull/950) for performing global pooling operations.
@@ -16,21 +16,28 @@
 * Testing suite improvements now test for gradients of all layers along with GPU support.
 * Functors have now moved to [Functors.jl](https://github.com/FluxML/Flux.jl/pull/1174) to allow for their use outside of Flux.
 * Added [helper functions](https://github.com/FluxML/Flux.jl/pull/873) `Flux.convfilter` and `Flux.depthwiseconvfilter` to construct weight arrays for convolutions outside of layer constructors so as to not have to depend on the default layers for custom implementations.
+* and many more fixes and additions...
+
+# v0.10.1 - v0.10.4
+
+See GitHub's releases.
 
 # v0.10.0
+
 * The default AD engine has switched from [Tracker to Zygote.jl](https://github.com/FluxML/Flux.jl/pull/669)
   - The dependency on Tracker.jl has been removed.
   - This means Flux now does not depend on using a specialised `TrackedArray` type, and can be used with normal Array implementations directly.
   - Tracker compatibility is maintained in most common cases, but Zygote will be the preferred AD backend for Flux from now on.
 * The CUDNN wrappers have been [moved from Flux into CuArrays](https://github.com/FluxML/Flux.jl/pull/874), to allow for better supporting the CUDA backend, and improve user experience, not to mention making Flux lean.
-* `*crossentropy` functions now [work as expected with CuArrays](https://github.com/FluxML/Flux.jl/pull/926). [PR for binarycrossentropy](https://github.com/FluxML/Flux.jl/pull/940).
+* `*crossentropy` functions now [work as expected with CuArrays](https://github.com/FluxML/Flux.jl/pull/926). [PR for bce_loss](https://github.com/FluxML/Flux.jl/pull/940).
 * Added [clearer docs](https://github.com/FluxML/Flux.jl/pull/904) around training and the Optimiser interface.
 * [Layer initialisations](https://github.com/FluxML/Flux.jl/pull/937) have been improved with a clearer API on how to extend it for other purposes.
 * [Better messaging around CUDA availability](https://github.com/FluxML/Flux.jl/pull/924), with hooks to initialize the GPU as default where possible.
 * `@treelike` has been formalised as a [functor](https://github.com/FluxML/Flux.jl/pull/865), with an effective deprecation.
 * `testmode!` is deprecated in favour of [istraining](https://github.com/FluxML/Flux.jl/pull/669)
 
 # v0.9.0
+
 * [Depthwise convolutional layer API changes](https://github.com/FluxML/Flux.jl/pull/756) from `in => mult` channel specification to `in => out` channel specification, and deprecates implicit `out` constructor.
 * New [SkipConnection](https://github.com/FluxML/Flux.jl/pull/446), which can be used to train residual neural network architectures.
 * New [RADAM](https://github.com/FluxML/Flux.jl/pull/842) optimiser.

diff --git a/docs/make.jl b/docs/make.jl
@@ -8,8 +8,9 @@ makedocs(modules=[Flux, NNlib],
                   "Building Models" =>
                     ["Basics" => "models/basics.md",
                      "Recurrence" => "models/recurrence.md",
-                     "Regularisation" => "models/regularisation.md",
                      "Model Reference" => "models/layers.md",
+                     "Loss Functions" => "models/losses.md",
+                     "Regularisation" => "models/regularisation.md",
                      "Advanced Model Building" => "models/advanced.md",
                      "NNlib" => "models/nnlib.md"],
                   "Handling Data" =>

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
@@ -73,22 +73,4 @@ Many normalisation layers behave differently under training and inference (testi
 ```@docs
 Flux.testmode!
 trainmode!
-```
-
-## Cost Functions
-```@docs
-Flux.mae
-Flux.mse
-Flux.msle
-Flux.huber_loss
-Flux.crossentropy
-Flux.logitcrossentropy
-Flux.binarycrossentropy
-Flux.logitbinarycrossentropy
-Flux.kldivergence
-Flux.poisson
-Flux.hinge
-Flux.squared_hinge
-Flux.dice_coeff_loss
-Flux.tversky_loss
-```
+```
diff --git a/docs/src/models/losses.md b/docs/src/models/losses.md
@@ -0,0 +1,40 @@
+## Loss Functions
+
+Flux provides a large number of common loss functions used for training machine learning models.
+
+Loss functions for supervised learning typically expect as inputs a target `y`, and a prediction `ŷ`.
+In Flux's convention, the order of the arguments is the following
+
+```julia
+loss(ŷ, y)
+```
+
+Most loss functions in Flux have an optional argument `agg`, denoting the type of aggregation performed over the
+batch:
+
+```julia
+loss(ŷ, y)                         # defaults to `mean`
+loss(ŷ, y, agg=sum)                # use `sum` for reduction
+loss(ŷ, y, agg=x->sum(x, dims=2))  # partial reduction
+loss(ŷ, y, agg=x->mean(w .* x))    # weighted mean
+loss(ŷ, y, agg=identity)           # no aggregation.
+```
+
+### Losses Reference
+
+```@docs
+Flux.mae
+Flux.mse
+Flux.msle
+Flux.huber_loss
+Flux.crossentropy
+Flux.logitcrossentropy
+Flux.bce_loss
+Flux.logitbce_loss
+Flux.kldivergence
+Flux.poisson_loss
+Flux.hinge_loss
+Flux.squared_hinge_loss
+Flux.dice_coeff_loss
+Flux.tversky_loss
+```
diff --git a/docs/src/models/regularisation.md b/docs/src/models/regularisation.md
@@ -7,9 +7,10 @@ add the result to the overall loss.
 For example, say we have a simple regression.
 
 ```julia
-using Flux: crossentropy
+using Flux
+using Flux: logitcrossentropy
 m = Dense(10, 5)
-loss(x, y) = crossentropy(softmax(m(x)), y)
+loss(x, y) = logitcrossentropy(m(x), y)
 ```
 
 We can regularise this by taking the (L2) norm of the parameters, `m.W` and `m.b`.
@@ -18,19 +19,19 @@ We can regularise this by taking the (L2) norm of the parameters, `m.W` and `m.b
 using LinearAlgebra
 
 penalty() = norm(m.W) + norm(m.b)
-loss(x, y) = crossentropy(softmax(m(x)), y) + penalty()
+loss(x, y) = logitcrossentropy(m(x), y) + penalty()
 ```
 
 When working with layers, Flux provides the `params` function to grab all
-parameters at once. We can easily penalise everything with `sum(norm, params)`.
+parameters at once. We can easily penalise everything with `sum`:
 
 ```julia
-julia> params(m)
+julia> Flux.params(m)
 2-element Array{Any,1}:
  param([0.355408 0.533092; … 0.430459 0.171498])
  param([0.0, 0.0, 0.0, 0.0, 0.0])
 
-julia> sum(norm, params(m))
+julia> sum(norm, Flux.params(m))
 26.01749952921026
 ```
 
@@ -40,9 +41,9 @@ Here's a larger example with a multi-layer perceptron.
 m = Chain(
   Dense(28^2, 128, relu),
   Dense(128, 32, relu),
-  Dense(32, 10), softmax)
+  Dense(32, 10))
 
-loss(x, y) = crossentropy(m(x), y) + sum(norm, params(m))
+loss(x, y) = logitcrossentropy(m(x), y) + sum(norm, Flux.params(m))
 
 loss(rand(28^2), rand(10))
 ```

diff --git a/src/Flux.jl b/src/Flux.jl
@@ -35,6 +35,7 @@ include("onehot.jl")
 include("functor.jl")
 
 include("layers/stateless.jl")
+include("layers/losses.jl")
 include("layers/basic.jl")
 include("layers/conv.jl")
 include("layers/recurrent.jl")

diff --git a/src/deprecations.jl b/src/deprecations.jl
@@ -1,2 +1,7 @@
-@deprecate param(x) x
-@deprecate data(x) x
+# v0.11 deprecations
+@deprecate poisson poisson_loss
+@deprecate hinge hinge_loss
+@deprecate squared_hinge squared_hinge_loss
+@deprecate binarycrossentropy(ŷ, y) bce_loss(ŷ, y, agg=identity)
+@deprecate logitbinarycrossentropy(ŷ, y) logitbce_loss(ŷ, y, agg=identity)
+@deprecate normalise(x) normalise(x, dims=1) 
diff --git a/src/layers/basic.jl b/src/layers/basic.jl
@@ -100,7 +100,7 @@ Dense(5, 2)
 julia> d(rand(5))
 2-element Array{Float32,1}:
  -0.16210233
-  0.12311903
+  0.123119034
 ```
 """
 struct Dense{F,S<:AbstractArray,T<:AbstractArray}

diff --git a/src/layers/losses.jl b/src/layers/losses.jl
diff --git a/src/layers/normalise.jl b/src/layers/normalise.jl
@@ -113,7 +113,7 @@ LayerNorm(h::Integer) =
 
 @functor LayerNorm
 
-(a::LayerNorm)(x) = a.diag(normalise(x))
+(a::LayerNorm)(x) = a.diag(normalise(x, dims=1))
 
 function Base.show(io::IO, l::LayerNorm)
   print(io, "LayerNorm(", length(l.diag.α), ")")