Skip to content
This repository was archived by the owner on Jul 1, 2023. It is now read-only.

[WIP]: Idea exploring easier layer init. #883

Closed
wants to merge 1 commit into from
Closed

Conversation

saeta
Copy link
Contributor

@saeta saeta commented Apr 24, 2020

The problem: when constructing a model, you often need to do some calculations
"out-of-band" / "by hand" to initialize the layers in the model correctly.
For example: in building a simple convolutional network with a dense layer on
top, you need to keep track of how the image is changing through the
convolutions to ensure you set the input size of the dense layer correctly. If
it is set incorrectly, you get an immediate shape mis-match.

The proposed solution: layers are given an extra "shape-based initializer" that
allows the layer's initializer to propagate shape information forward.

Alternatives considered: a number of libraries (e.g. Keras, Haiku) don't
initialize the parameters of libraries until the first run through. This is
cumbersome in Swift, as that would require func callAsFunction to be marked
as mutating, or for layers to become classes.

Problems of the current design: I just put this together quickly and
specialized everything to Scalar == Float. This is obviously suboptimal.

Extensions:

  • No-op device: Right now, we perform computations on zeros, just to avoid
    writing duplicated shape-propagation rules. Instead, if we had a no-op
    device, then we could trivially reuse the layer implementations. (Or
    alternatively, if we somehow made func callAsFunction generic over the
    Tensor type.)
  • VMap: Proper vmap support would simplify these shape-heavy computations.

The problem: when constructing a model, you often need to do some calculations
"out-of-band" / "by hand" to initialize the layers in the model correctly.
For example: in building a simple convolutional network with a dense layer on
top, you need to keep track of how the image is changing through the
convolutions to ensure you set the input size of the dense layer correctly. If
it is set incorrectly, you get an immediate shape mis-match.

The proposed solution: layers are given an extra "shape-based initializer" that
allows the layer's initializer to propagate shape information forward.

Alternatives considered: a number of libraries (e.g. Keras, Haiku) don't
initialize the parameters of libraries until the first run through. This is
cumbersome in Swift, as that would require `func callAsFunction` to be marked
as mutating, or for layers to become `class`es.

Problems of the current design: I just put this together quickly and
specialized everything to `Scalar == Float`. This is obviously suboptimal.

Extensions:
 - *No-op device*: Right now, we perform computations on zeros, just to avoid
   writing duplicated shape-propagation rules. Instead, if we had a no-op
   device, then we could trivially reuse the layer implementations. (Or
   alternatively, if we somehow made `func callAsFunction` generic over the
   Tensor type.)
 - *VMap*: Proper vmap support would simplify these shape-heavy computations.
@saeta saeta marked this pull request as draft April 24, 2020 20:36
@dabrahams
Copy link
Contributor

Initializing parameters on first run can be done without making layers classes or appear to be non-pure functions (i.e. mark callAsFunction mutating), but you would have to put the mutation somewhere other than in the directly stored properties of the layer, like an attached class instance. I'm not sure why that would be the wrong answer for us.

I am a little weak conceptually; why should one have to "set the input size" on the Dense layer, as opposed to simply having it come from the size of the input that is actually passed to it?

@saeta
Copy link
Contributor Author

saeta commented Apr 24, 2020

I am a little weak conceptually; why should one have to "set the input size" on the Dense layer, as opposed to simply having it come from the size of the input that is actually passed to it?

Here's what we do today:

var classifier = Sequential {
    Conv2D<Float>(filterShape: (5, 5, 1, 6), padding: .same, activation: relu)
    AvgPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
    Conv2D<Float>(filterShape: (5, 5, 6, 16), activation: relu)
    AvgPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
    Flatten<Float>()
    Dense<Float>(inputSize: 400, outputSize: 120, activation: relu)
    Dense<Float>(inputSize: 120, outputSize: 84, activation: relu)
    Dense<Float>(inputSize: 84, outputSize: 10)
}

In particular, we know (a priori) that after the first 5 layers (Conv2D's, AvgPool2D's, and Flatten) that given our known (fixed) image input size that we will have tensors of shape [batchSize, 400]. Further, while the outputSize is an orthogonal hyperparameter, the subsequent inputSize (e.g. 120, and 84) are just "copied" from the previous layer. Does that help clarify?

Copy link
Member

@dan-zheng dan-zheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support Keras-style Sequential shape inference, I think a layer protocol requirement for shape propagation is also necessary?

protocol Layer: Differentiable {
  associatedtype Input: Differentiable
  associatedtype Output: Differentiable
  func callAsFunction(_ input: Input) -> Output

  /// Returns the output shape of this layer given an input shape.
  ///
  /// This implements shape propagation.
  func outputShape(for inputShape: TensorShape) -> TensorShape
}

I'm not sure it's easy for layers to implement shape propagation unless shape propagation is also defined for primitive tensor operations (like matmul and conv2d).

@saeta
Copy link
Contributor Author

saeta commented Apr 24, 2020

I'm not sure it's easy for layers to implement shape propagation unless shape propagation is also defined for primitive tensor operations (like matmul and conv2d).

Right, I think in the design you proposed, it's not so easy. My goal with exploring this direction is to re-use the implicit shape transfer functions inherent in the tensor operations themselves. This approach also conveniently avoids needing to re-write the logic inside callAsFunction inside outputShape(for:) as well. Does that make sense?

@dan-zheng
Copy link
Member

dan-zheng commented Apr 26, 2020

My goal with exploring this direction is to re-use the implicit shape transfer functions inherent in the tensor operations themselves. This approach also conveniently avoids needing to re-write the logic inside callAsFunction inside outputShape(for:) as well. Does that make sense?

I thought about this a bit. I think this PR (adding layer initialization based on hyperparameters and input shapes) is orthogonal to shape propagation. But shape propagation still seems necessary to implement Keras-style shape-inferring Sequential.


I wanted to explore shape propagation a bit further, so I ended up implementing a shape-inferring Sequential! Here's a Gist:

let input = Tensor<Float>(randomNormal: [10000, 784])
let model = Sequential(inputShape: input.shape) {
  Dense<Float>.make(.init(outputSize: 784))
  Dense<Float>.make(.init(outputSize: 400, useBias: true))
  Dense<Float>.make(.init(outputSize: 100))
  Dense<Float>.make(.init(outputSize: 10, activation: relu))
}
print(model(input).shape) // [10000, 10]

The Gist adds layer initialization based on hyperparameters and input shapes too. But layer initializers are curried, unlike the approach in this PR:

// Curried:
(Layer.Hyperparameters) -> (Layer.Input.Shape) -> Layer
// Uncurried:
(Layer.Hyperparameters, Layer.Input.Shape) -> Layer

Curried initializers seem necessary to easily support a ShapedLayerBuilder function builder for Sequential: each value in the Sequential trailing closure effectively has type (Layer.Input.Shape) -> Layer.

I wonder what others think about the Gist's approach? I could turn it into a separate issue or PR for discussion. The use case of "shape-inferring Sequential" influences the design of layer initialization.

@dabrahams
Copy link
Contributor

dabrahams commented Apr 26, 2020

Does that help clarify?

@saeta I'm afraid not. I see no reason the same model couldn't be written like this, given the right library pieces (I'm leaving out other niceties, like avoiding writing <Float> everywhere, for now):

var classifier = Sequential {
    Conv2D<Float>(filterShape: (5, 5, 1, 6), padding: .same, activation: relu)
    AvgPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
    Conv2D<Float>(filterShape: (5, 5, 6, 16), activation: relu)
    AvgPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
    Flatten<Float>()
    Dense<Float>(outputSize: 120, activation: relu)
    Dense<Float>(outputSize: 84, activation: relu)
    Dense<Float>(outputSize: 10)
}

IIUC this is essentially what @dan-zheng's gist proves out, with different syntax. Am I missing something?

@dabrahams
Copy link
Contributor

@dan-zheng It has always seemed clear to me that everything (not just Sequential) should be “shape-inferring.”

@saeta
Copy link
Contributor Author

saeta commented Apr 27, 2020

(I'm leaving out other niceties, like avoiding writing <Float> everywhere, for now)

I've been meaning to play around more based on your suggestion a while back. I put together a quick PR: tensorflow/swift-models#465 It appears that this trick doesn't work cross-module for some reason. Do you know what I'm doing wrong, or should these be filed as issues?

@dabrahams
Copy link
Contributor

@saeta Can you be specific about what doesn't work, or tell me how to use your PR to reproduce a compilation failure?

@brettkoonce
Copy link
Contributor

@dabrahams believe this is the error in question: tensorflow/swift-models#465 (comment)

@saeta
Copy link
Contributor Author

saeta commented Apr 27, 2020

Can you be specific about what doesn't work, or tell me how to use your PR to reproduce a compilation failure?
believe this is the error in question: tensorflow/swift-models#465

Yup, @brettkoonce is exactly right. I played around with it more today, and it turns out that this was a SwiftPM non-determinism bug. I minimized it down to a trivial example: https://bugs.swift.org/browse/SR-12688

@dabrahams : You alluded to "given the right library pieces"; do you think you could share a bit what you have in mind? (Either in textual or executable form?) :-)

@dabrahams
Copy link
Contributor

dabrahams commented Apr 28, 2020

@saeta I don't have anything very specific in mind. Maybe it's just a matter of perspective. If you view the model code as the use of an EDSL for describing the computation it performs, and there are places where you know the output shapes can be deduced at runtime from input shapes, it's clear you don't need to specify them in the model. The problem of translating a given specification (model) into code/data structures for actually performing the computation is separable, and not bounded, for example, by the idea that constructing the thing called Conv2D above has to allocate the memory the corresponding part of the model needs.

@shabalind
Copy link
Contributor

We are closing this one for now!

@shabalind shabalind closed this Jun 10, 2020
@saeta
Copy link
Contributor Author

saeta commented Jun 10, 2020

Tagging @shadaj who will be pushing on this.

@saeta saeta deleted the easy-layer-init branch June 10, 2020 17:17
@saeta
Copy link
Contributor Author

saeta commented Jun 10, 2020

For those following along at home, check out @shadaj 's work in: tensorflow/swift-models#584

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants