-
Notifications
You must be signed in to change notification settings - Fork 137
[WIP]: Idea exploring easier layer init. #883
Conversation
The problem: when constructing a model, you often need to do some calculations "out-of-band" / "by hand" to initialize the layers in the model correctly. For example: in building a simple convolutional network with a dense layer on top, you need to keep track of how the image is changing through the convolutions to ensure you set the input size of the dense layer correctly. If it is set incorrectly, you get an immediate shape mis-match. The proposed solution: layers are given an extra "shape-based initializer" that allows the layer's initializer to propagate shape information forward. Alternatives considered: a number of libraries (e.g. Keras, Haiku) don't initialize the parameters of libraries until the first run through. This is cumbersome in Swift, as that would require `func callAsFunction` to be marked as mutating, or for layers to become `class`es. Problems of the current design: I just put this together quickly and specialized everything to `Scalar == Float`. This is obviously suboptimal. Extensions: - *No-op device*: Right now, we perform computations on zeros, just to avoid writing duplicated shape-propagation rules. Instead, if we had a no-op device, then we could trivially reuse the layer implementations. (Or alternatively, if we somehow made `func callAsFunction` generic over the Tensor type.) - *VMap*: Proper vmap support would simplify these shape-heavy computations.
Initializing parameters on first run can be done without making layers classes or appear to be non-pure functions (i.e. mark callAsFunction mutating), but you would have to put the mutation somewhere other than in the directly stored properties of the layer, like an attached class instance. I'm not sure why that would be the wrong answer for us. I am a little weak conceptually; why should one have to "set the input size" on the Dense layer, as opposed to simply having it come from the size of the input that is actually passed to it? |
In particular, we know (a priori) that after the first 5 layers (Conv2D's, AvgPool2D's, and Flatten) that given our known (fixed) image input size that we will have tensors of shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support Keras-style Sequential
shape inference, I think a layer protocol requirement for shape propagation is also necessary?
protocol Layer: Differentiable {
associatedtype Input: Differentiable
associatedtype Output: Differentiable
func callAsFunction(_ input: Input) -> Output
/// Returns the output shape of this layer given an input shape.
///
/// This implements shape propagation.
func outputShape(for inputShape: TensorShape) -> TensorShape
}
I'm not sure it's easy for layers to implement shape propagation unless shape propagation is also defined for primitive tensor operations (like matmul
and conv2d
).
Right, I think in the design you proposed, it's not so easy. My goal with exploring this direction is to re-use the implicit shape transfer functions inherent in the tensor operations themselves. This approach also conveniently avoids needing to re-write the logic inside |
I thought about this a bit. I think this PR (adding layer initialization based on hyperparameters and input shapes) is orthogonal to shape propagation. But shape propagation still seems necessary to implement Keras-style shape-inferring I wanted to explore shape propagation a bit further, so I ended up implementing a shape-inferring let input = Tensor<Float>(randomNormal: [10000, 784])
let model = Sequential(inputShape: input.shape) {
Dense<Float>.make(.init(outputSize: 784))
Dense<Float>.make(.init(outputSize: 400, useBias: true))
Dense<Float>.make(.init(outputSize: 100))
Dense<Float>.make(.init(outputSize: 10, activation: relu))
}
print(model(input).shape) // [10000, 10] The Gist adds layer initialization based on hyperparameters and input shapes too. But layer initializers are curried, unlike the approach in this PR: // Curried:
(Layer.Hyperparameters) -> (Layer.Input.Shape) -> Layer
// Uncurried:
(Layer.Hyperparameters, Layer.Input.Shape) -> Layer Curried initializers seem necessary to easily support a I wonder what others think about the Gist's approach? I could turn it into a separate issue or PR for discussion. The use case of "shape-inferring |
@saeta I'm afraid not. I see no reason the same model couldn't be written like this, given the right library pieces (I'm leaving out other niceties, like avoiding writing var classifier = Sequential {
Conv2D<Float>(filterShape: (5, 5, 1, 6), padding: .same, activation: relu)
AvgPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
Conv2D<Float>(filterShape: (5, 5, 6, 16), activation: relu)
AvgPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
Flatten<Float>()
Dense<Float>(outputSize: 120, activation: relu)
Dense<Float>(outputSize: 84, activation: relu)
Dense<Float>(outputSize: 10)
} IIUC this is essentially what @dan-zheng's gist proves out, with different syntax. Am I missing something? |
@dan-zheng It has always seemed clear to me that everything (not just |
I've been meaning to play around more based on your suggestion a while back. I put together a quick PR: tensorflow/swift-models#465 It appears that this trick doesn't work cross-module for some reason. Do you know what I'm doing wrong, or should these be filed as issues? |
@saeta Can you be specific about what doesn't work, or tell me how to use your PR to reproduce a compilation failure? |
@dabrahams believe this is the error in question: tensorflow/swift-models#465 (comment) |
Yup, @brettkoonce is exactly right. I played around with it more today, and it turns out that this was a SwiftPM non-determinism bug. I minimized it down to a trivial example: https://bugs.swift.org/browse/SR-12688 @dabrahams : You alluded to "given the right library pieces"; do you think you could share a bit what you have in mind? (Either in textual or executable form?) :-) |
@saeta I don't have anything very specific in mind. Maybe it's just a matter of perspective. If you view the model code as the use of an EDSL for describing the computation it performs, and there are places where you know the output shapes can be deduced at runtime from input shapes, it's clear you don't need to specify them in the model. The problem of translating a given specification (model) into code/data structures for actually performing the computation is separable, and not bounded, for example, by the idea that constructing the thing called |
We are closing this one for now! |
Tagging @shadaj who will be pushing on this. |
For those following along at home, check out @shadaj 's work in: tensorflow/swift-models#584 |
The problem: when constructing a model, you often need to do some calculations
"out-of-band" / "by hand" to initialize the layers in the model correctly.
For example: in building a simple convolutional network with a dense layer on
top, you need to keep track of how the image is changing through the
convolutions to ensure you set the input size of the dense layer correctly. If
it is set incorrectly, you get an immediate shape mis-match.
The proposed solution: layers are given an extra "shape-based initializer" that
allows the layer's initializer to propagate shape information forward.
Alternatives considered: a number of libraries (e.g. Keras, Haiku) don't
initialize the parameters of libraries until the first run through. This is
cumbersome in Swift, as that would require
func callAsFunction
to be markedas mutating, or for layers to become
class
es.Problems of the current design: I just put this together quickly and
specialized everything to
Scalar == Float
. This is obviously suboptimal.Extensions:
writing duplicated shape-propagation rules. Instead, if we had a no-op
device, then we could trivially reuse the layer implementations. (Or
alternatively, if we somehow made
func callAsFunction
generic over theTensor type.)