-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CTC loss to new Losses module #1287
Conversation
has this code been tested in real applications? |
That is a good question. I haven't used this code for an actual project since Google Summer of Code 2018 because the kind of sequence it produces is not useful for my research. So, I've been working on a couple projects this weekend to make sure it scales up to something beyond the test cases, and fixing things as necessary. They're a little slow because I hacked them together without regard for runtime, so I'll report back when I have a confirmation that it's working (hopefully tomorrow). Maybe one of them would be a good addition to the model zoo as an example of how to use the loss function too, if the runtime is acceptable enough. |
So, I've spent the last month trying to get a sample network working on the TIMIT task, but I have been having trouble getting the network to converge in a reasonable amount of time. I'm only using the CPU implementation at the moment, and I'm reasonably confident that it's calculating the loss and gradients correctly. At least, they are similar to the outputs from Chainer's CTC function and the Stanford CTC implementation. As an additional check, I trained networks that should be equivalent in Flux and Chainer. One network in Flux used my CTC implementation, and the other used Chainer's CTC loss and gradients. The Chainer network just used its own CTC loss. Strangely, both Flux networks converged very slowly compared to the Chainer network on the same subset. I've checked on both TIMIT and a speech corpus publically available from the lab I work in, and I got similar results whether I used ADAM or SGD+Momentum as the optimizer. I'm currently looking at some runs through the entire data set my lab has, and training hasn't quite finished, but the convergence time is looking to be similar to the subsets of data, just that each epoch takes longer to run (approx. 1.5 days). I can post those results when I have them. As far as I can tell, the issue isn't with the loss gradients or else I would expect that using the Chainer CTC loss and gradients would give similar performance in Flux as it does in Chainer. I've created a GitHub repo to demonstrate the behavior I'm seeing, though it is not an exhaustive collection of everything I've tried. It uses the speech corpus my lab has available so that folks don't have to try to register for or purchase TIMIT if they don't already have it. So, as far as the CPU implementation goes, it's looking like it will eventually successfully train a network on a non-trivial data set, though it takes substantially more epochs in Flux than other frameworks. At this point, I don't really know how to address the slow convergence. Perhaps what's wrong or what a good check to perform would be is obvious to someone who knows more about Flux and Zygote's current internals, though. And, maybe I'm just not feeding the custom gradient back correctly. Any thoughts, anyone? |
On a whim, I saw a note in the Zygote source code about broadcasting being difficult to design the autodiff for, so I changed the model to use After playing with the gradients, I found that multiplying the gradients from the loss function by 10 resulted in the broadcasting version working as well as the Anyway, I'm hoping I can make a full demo model for both CPU and GPU in the next few days here, and I'll also see if I can create an MWE for the map vs. broadcasting difference and open an issue for that separately. |
General updates after changing example networks to use map instead of dot broadcasting.
This reverts commit f471337.
Update codebase from original package
Sorry for the long delay; teaching this semester has taken up a lot of the time I could've otherwise devoted to resolving this. But, I think this may finally be more or less ready. I have replicated the results from a chapter in Alex Graves's (2012) book, which is itself a re-presentation of the results from the original paper. The replication works with both the CPU and GPU implementations for Flux v0.11.1. The replication code and results can be found in this GitHub repo I've made, which have the same I'm not sure if the v0.12.0 entries in the NEWS.md file are frozen yet for an upcoming release, but I added this to the v0.12.0 notes because I hadn't seen that version see a GitHub release yet. I am happy to make a new entry for v0.12.1 if preferred, though. |
I feel you, I've been through the same 😭 Except for those mainly stylistic comments, this looks fine. |
No, the current adjoint implementation is working fine. I had been thinking through how a small adjoint for a mutation operation might be handled so that Zygote could handle the gradients and use less code overall, but I didn't make much progress there, and I didn't commit any code where I was toying with that.
I am happy to just focus on correctness for now. I am more confident in the unsplit kernels, so I could revert the last commit and make sure the tests still pass (and they very well should), after which I believe both implementations would be correct. I have been running the tests for the GPU functions on my own machine as I've been making these changes, so the GPU function should be fine too. Some of the tests compare the GPU output with the CPU output, so they should be equivalent as long as the tests are passed. |
I common pattern we have for the adjoint definitions would look like this: ctc_loss(yhat::AbstractArray, y) = ...
ctc_loss(yhat::CuArray, y) = ...
∇ctc_loss(yhat::AbstractArray, y, out) = ...
∇ctc_loss(yhat::CuArray, y, out) = ...
@adjoint function ctc_loss(yhat, y)
out = ctc_loss(yhat, y)
ctc_loss_pullback(Δ) = (Δ .* ∇ctc_loss(yhat, y, out), nothing)
return out, ctc_loss_pullback
end If more calculations are shared between |
I think that most recent commit should take care of the interface and adjoint definitions. I have another one I haven't pushed yet because I'm not sure if it would be better to wait and open a new PR for it or not. But, it would change the way the target argument is handled so that it would accept a onecold vector instead of a onehot matrix. This makes it easier to handle input that involves repeated labels, such as the repeated "m" in "programming". The current implementation can handle it, but it requires the user to put a blank symbol between the repeated characters so they don't get collapsed (e.g., the onehot matrix equivalent of "program_ming"). |
It is better to choose know whether to commit to one of the 2 interfaces or to both.
Are there any cons? Why should we support the onehot representation at all? |
I would prefer to have the onecold representation, and I don't think there are really any cons. The onehot expectation was a holdover from when I first impelemented the function during GSoC '18 and was still coming to grips with how CTC was implemented. The only sort of plus for having the onehot representation is that it is parallel to the standard format for cross entropy calculations, but I don't think it's a particularly compelling reason to keep the onehot representation. I'm currently re-running my test networks to verify that the loss still works with the onecold representation and can push those changes once I see that the networks still train. |
My test networks have now trained to a similar level of accuracy as when we had the onehot representation. This commit also removes the |
Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
Sorry about that. My text editor inserted some tab characters for some reason but didn't display the file any differently on my end. I have replaced them with spaces and checked the other files in the PR for any errant tabs (no other files seemed to have been affected though). |
It's been a long journey, thanks for your effort! bors r+ |
1287: Add CTC loss to new Losses module r=CarloLucibello a=maetshju This is a redux of adding the connectionist temporal classification loss from #342, now that the Losses module has been merged in #1264. Discussion in #342 suggested that a new PR would be easier than rebasing. Since the last commit in #342, functions and data structures from `CUDAnative.jl` and `CuArrays.jl` have been updated to work with `CUDA.jl`. This is in addition to incorporating the loss function into the Losses module. ### PR Checklist - [X] Tests are added - [X] Entry in NEWS.md - [X] Documentation, if applicable - [ ] Final review from `@dhairyagandhi96` (for API changes). Co-authored-by: Matt Kelley <matthew.curtis.kelley@gmail.com> Co-authored-by: Matthew C. Kelley <matthew.curtis.kelley@gmail.com>
bors r+ |
Build succeeded: |
@CarloLucibello this does count as an api change and was merged errantly. We still don't want different implementations for cpu and GPU, and this shouldn't change that. |
Thanks for all the work @maetshju, hopefully we would be able to see the two I'm implementations converge soon. We might need to tweak some things in cuda.jl, so it's good to have a reference to compare against |
Thank you @CarloLucibello and @DhairyaLGandhi for all your attention and reviews on this code. Please don't hesitate to tag me in related future discussions or bug reports if you would like my input. |
This is a redux of adding the connectionist temporal classification loss from #342, now that the Losses module has been merged in #1264. Discussion in #342 suggested that a new PR would be easier than rebasing.
Since the last commit in #342, functions and data structures from
CUDAnative.jl
andCuArrays.jl
have been updated to work withCUDA.jl
. This is in addition to incorporating the loss function into the Losses module.PR Checklist
@dhairyagandhi96
(for API changes).