diff --git a/docs/src/index.md b/docs/src/index.md
index 0408c994f3..76e29bc2a7 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -12,7 +12,7 @@ Download [Julia 1.6](https://julialang.org/downloads/) or later, preferably the
 
 This will automatically install several other packages, including [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) which supports Nvidia GPUs. To directly access some of its functionality, you may want to add `] add CUDA` too. The page on [GPU support](gpu.md) has more details.
 
-Other closely associated packages include [Zygote.jl](https://github.com/FluxML/Zygote.jl), [Optimisers.jl](https://github.com/FluxML/Optimisers.jl), [NNlib.jl](https://github.com/FluxML/NNlib.jl), [Functors.jl](https://github.com/FluxML/Functors.jl) and [MLUtils.jl](https://github.com/JuliaML/MLUtils.jl).
+Other closely associated packages, which are also installed, are [Zygote.jl](https://github.com/FluxML/Zygote.jl), [Optimisers.jl](https://github.com/FluxML/Optimisers.jl), [NNlib.jl](https://github.com/FluxML/NNlib.jl), [Functors.jl](https://github.com/FluxML/Functors.jl) and [MLUtils.jl](https://github.com/JuliaML/MLUtils.jl).
 
 ## Learning Flux
 
@@ -26,4 +26,4 @@ If you just want to get started writing models, the [model zoo](https://github.c
 
 Everyone is welcome to join our community on the [Julia discourse forum](https://discourse.julialang.org/), or the [slack chat](https://discourse.julialang.org/t/announcing-a-julia-slack/4866) (channel #machine-learning). If you have questions or issues we'll try to help you out.
 
-If you're interested in hacking on Flux, the [source code](https://github.com/FluxML/Flux.jl) is open and easy to understand -- it's all just the same Julia code you work with normally. You might be interested in our [intro issues](https://github.com/FluxML/Flux.jl/labels/good%20first%20issue) to get started or our [contributing guide](https://github.com/FluxML/Flux.jl/blob/master/CONTRIBUTING.md).
+If you're interested in hacking on Flux, the [source code](https://github.com/FluxML/Flux.jl) is open and easy to understand -- it's all just the same Julia code you work with normally. You might be interested in our [intro issues](https://github.com/FluxML/Flux.jl/labels/good%20first%20issue) to get started, or our [contributing guide](https://github.com/FluxML/Flux.jl/blob/master/CONTRIBUTING.md).
diff --git a/docs/src/models/basics.md b/docs/src/models/basics.md
index 02385a4e0b..f884598c8e 100644
--- a/docs/src/models/basics.md
+++ b/docs/src/models/basics.md
@@ -211,7 +211,7 @@ m = Chain(x -> x^2, x -> x+1)
 m(5) # => 26
 ```
 
-## Layer helpers
+## Layer Helpers
 
 Flux provides a set of helpers for custom layers, which you can enable by calling
 
diff --git a/docs/src/models/overview.md b/docs/src/models/overview.md
index 9de9a463c8..942ff244c1 100644
--- a/docs/src/models/overview.md
+++ b/docs/src/models/overview.md
@@ -11,7 +11,7 @@ Under the hood, Flux uses a technique called automatic differentiation to take g
 
 Here's how you'd use Flux to build and train the most basic of models, step by step.
 
-## Make a Trivial Prediction
+### A Trivial Prediction
 
 This example will predict the output of the function `4x + 2`. Making such predictions is called "linear regression", and is really too simple to *need* a neural network. But it's a nice toy example.
 
@@ -26,7 +26,7 @@ actual (generic function with 1 method)
 
 This example will build a model to approximate the `actual` function.
 
-## Provide Training and Test Data
+## 1. Provide Training and Test Data
 
 Use the `actual` function to build sets of data for training and verification:
 
@@ -40,7 +40,7 @@ julia> y_train, y_test = actual.(x_train), actual.(x_test)
 
 Normally, your training and test data come from real world observations, but here we simulate them.
 
-## Build a Model to Make Predictions
+## 2. Build a Model to Make Predictions
 
 Now, build a model to make predictions with `1` input and `1` output:
 
@@ -85,7 +85,7 @@ julia> loss(x_train, y_train)
 
 More accurate predictions will yield a lower loss. You can write your own loss functions or rely on those already provided by Flux. This loss function is called [mean squared error](https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/mean-squared-error/). Flux works by iteratively reducing the loss through *training*.
 
-## Improve the Prediction
+## 3. Improve the Prediction
 
 Under the hood, the Flux [`Flux.train!`](@ref) function uses *a loss function* and *training data* to improve the *parameters* of your model based on a pluggable [`optimiser`](../training/optimisers.md):
 
@@ -150,7 +150,7 @@ Params([Float32[7.5777884], Float32[1.9466728]])
 
 The parameters have changed. This single step is the essence of machine learning.
 
-## Iteratively Train the Model
+## 3+. Iteratively Train the Model
 
 In the previous section, we made a single call to `train!` which iterates over the data we passed in just once. An *epoch* refers to one pass over the dataset. Typically, we will run the training for multiple epochs to drive the loss down even further. Let's run it a few more times:
 
@@ -168,7 +168,7 @@ Params([Float32[4.0178537], Float32[2.0050256]])
 
 After 200 training steps, the loss went down, and the parameters are getting close to those in the function the model is built to predict.
 
-## Verify the Results
+## 4. Verify the Results
 
 Now, let's verify the predictions:
 
diff --git a/docs/src/models/quickstart.md b/docs/src/models/quickstart.md
index eeae21b040..d784713bcb 100644
--- a/docs/src/models/quickstart.md
+++ b/docs/src/models/quickstart.md
@@ -26,8 +26,8 @@ mat = Flux.onehotbatch(truth, [true, false])                      # 2×1000 OneH
 data = Flux.DataLoader((noisy, mat), batchsize=64, shuffle=true);
 first(data) .|> summary                                           # ("2×64 Matrix{Float32}", "2×64 Matrix{Bool}")
 
-pars = Flux.params(model)
-opt = Flux.Adam(0.01)
+pars = Flux.params(model)  # contains references to arrays in model
+opt = Flux.Adam(0.01)      # will store optimiser momentum etc.
 
 # Training loop, using whole data set 1000 times:
 for epoch in 1:1_000
@@ -39,6 +39,7 @@ for epoch in 1:1_000
 end
 
 pars  # has changed!
+opt
 out2 = model(noisy)
 
 mean((out2[1,:] .> 0.5) .== truth)  # accuracy 94% so far!