Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CTC loss to new Losses module #1287

Merged
merged 36 commits into from
Jan 20, 2021
Merged

Add CTC loss to new Losses module #1287

merged 36 commits into from
Jan 20, 2021

Conversation

maetshju
Copy link
Contributor

This is a redux of adding the connectionist temporal classification loss from #342, now that the Losses module has been merged in #1264. Discussion in #342 suggested that a new PR would be easier than rebasing.

Since the last commit in #342, functions and data structures from CUDAnative.jl and CuArrays.jl have been updated to work with CUDA.jl. This is in addition to incorporating the loss function into the Losses module.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

NEWS.md Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

has this code been tested in real applications?

@maetshju
Copy link
Contributor Author

That is a good question. I haven't used this code for an actual project since Google Summer of Code 2018 because the kind of sequence it produces is not useful for my research. So, I've been working on a couple projects this weekend to make sure it scales up to something beyond the test cases, and fixing things as necessary. They're a little slow because I hacked them together without regard for runtime, so I'll report back when I have a confirmation that it's working (hopefully tomorrow). Maybe one of them would be a good addition to the model zoo as an example of how to use the loss function too, if the runtime is acceptable enough.

@maetshju
Copy link
Contributor Author

maetshju commented Aug 26, 2020

So, I've spent the last month trying to get a sample network working on the TIMIT task, but I have been having trouble getting the network to converge in a reasonable amount of time. I'm only using the CPU implementation at the moment, and I'm reasonably confident that it's calculating the loss and gradients correctly. At least, they are similar to the outputs from Chainer's CTC function and the Stanford CTC implementation.

As an additional check, I trained networks that should be equivalent in Flux and Chainer. One network in Flux used my CTC implementation, and the other used Chainer's CTC loss and gradients. The Chainer network just used its own CTC loss. Strangely, both Flux networks converged very slowly compared to the Chainer network on the same subset. I've checked on both TIMIT and a speech corpus publically available from the lab I work in, and I got similar results whether I used ADAM or SGD+Momentum as the optimizer. I'm currently looking at some runs through the entire data set my lab has, and training hasn't quite finished, but the convergence time is looking to be similar to the subsets of data, just that each epoch takes longer to run (approx. 1.5 days). I can post those results when I have them.

As far as I can tell, the issue isn't with the loss gradients or else I would expect that using the Chainer CTC loss and gradients would give similar performance in Flux as it does in Chainer. I've created a GitHub repo to demonstrate the behavior I'm seeing, though it is not an exhaustive collection of everything I've tried. It uses the speech corpus my lab has available so that folks don't have to try to register for or purchase TIMIT if they don't already have it.

So, as far as the CPU implementation goes, it's looking like it will eventually successfully train a network on a non-trivial data set, though it takes substantially more epochs in Flux than other frameworks. At this point, I don't really know how to address the slow convergence. Perhaps what's wrong or what a good check to perform would be is obvious to someone who knows more about Flux and Zygote's current internals, though. And, maybe I'm just not feeding the custom gradient back correctly. Any thoughts, anyone?

@maetshju
Copy link
Contributor Author

maetshju commented Aug 28, 2020

On a whim, I saw a note in the Zygote source code about broadcasting being difficult to design the autodiff for, so I changed the model to use map instead of broadcasting to deal with the sequences. Suddenly, the Flux version was converging properly!

After playing with the gradients, I found that multiplying the gradients from the loss function by 10 resulted in the broadcasting version working as well as the map version. Multiplying the learning rate of the optimizer by 10 had a similar effect when using broadcasting.

Anyway, I'm hoping I can make a full demo model for both CPU and GPU in the next few days here, and I'll also see if I can create an MWE for the map vs. broadcasting difference and open an issue for that separately.

General updates after changing example networks to use map instead
of dot broadcasting.
@maetshju
Copy link
Contributor Author

maetshju commented Dec 7, 2020

Sorry for the long delay; teaching this semester has taken up a lot of the time I could've otherwise devoted to resolving this.

But, I think this may finally be more or less ready. I have replicated the results from a chapter in Alex Graves's (2012) book, which is itself a re-presentation of the results from the original paper. The replication works with both the CPU and GPU implementations for Flux v0.11.1. The replication code and results can be found in this GitHub repo I've made, which have the same ctc.jl and ctc-gpu.jl files as in this commit. At one point, I had tested using Tracker for AD instead of Zygote and gotten similar results, though I have not reported on that in the repo.

I'm not sure if the v0.12.0 entries in the NEWS.md file are frozen yet for an upcoming release, but I added this to the v0.12.0 notes because I hadn't seen that version see a GitHub release yet. I am happy to make a new entry for v0.12.1 if preferred, though.

src/losses/Losses.jl Outdated Show resolved Hide resolved
src/losses/ctc-gpu.jl Outdated Show resolved Hide resolved
src/losses/ctc-gpu.jl Outdated Show resolved Hide resolved
src/losses/ctc-gpu.jl Show resolved Hide resolved
src/losses/ctc-gpu.jl Outdated Show resolved Hide resolved
src/losses/ctc.jl Outdated Show resolved Hide resolved
src/losses/ctc.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

Sorry for the long delay; teaching this semester has taken up a lot of the time I could've otherwise devoted to resolving this.

I feel you, I've been through the same 😭

Except for those mainly stylistic comments, this looks fine.

@maetshju
Copy link
Contributor Author

So the current adjoint implementation is broken?

No, the current adjoint implementation is working fine. I had been thinking through how a small adjoint for a mutation operation might be handled so that Zygote could handle the gradients and use less code overall, but I didn't make much progress there, and I didn't commit any code where I was toying with that.

let's leave aside any further performance considerations (those can be eventually be addressed in later PRs), interface with cudnn, and so on...., let's just focus on correctness.

I am happy to just focus on correctness for now. I am more confident in the unsplit kernels, so I could revert the last commit and make sure the tests still pass (and they very well should), after which I believe both implementations would be correct. I have been running the tests for the GPU functions on my own machine as I've been making these changes, so the GPU function should be fine too. Some of the tests compare the GPU output with the CPU output, so they should be equivalent as long as the tests are passed.

@CarloLucibello
Copy link
Member

CarloLucibello commented Jan 1, 2021

I common pattern we have for the adjoint definitions would look like this:

ctc_loss(yhat::AbstractArray, y) = ... 
ctc_loss(yhat::CuArray, y) = ... 

∇ctc_loss(yhat::AbstractArray, y, out) = ... 
∇ctc_loss(yhat::CuArray, y, out) = ...
 
@adjoint function ctc_loss(yhat, y)
     out = ctc_loss(yhat, y)
     ctc_loss_pullback(Δ) =.* ∇ctc_loss(yhat, y, out), nothing)
     return out, ctc_loss_pullback
end

If more calculations are shared between ctc_loss and ∇ctc_loss they could be factorized and used in the adjoint definition.

@maetshju
Copy link
Contributor Author

maetshju commented Jan 3, 2021

I think that most recent commit should take care of the interface and adjoint definitions.

I have another one I haven't pushed yet because I'm not sure if it would be better to wait and open a new PR for it or not. But, it would change the way the target argument is handled so that it would accept a onecold vector instead of a onehot matrix. This makes it easier to handle input that involves repeated labels, such as the repeated "m" in "programming". The current implementation can handle it, but it requires the user to put a blank symbol between the repeated characters so they don't get collapsed (e.g., the onehot matrix equivalent of "program_ming").

src/losses/ctc.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

It is better to choose know whether to commit to one of the 2 interfaces or to both.
I am no expert of NLP so I'm not the best judge here, but the pros I see for the onecold only implementation are

Are there any cons? Why should we support the onehot representation at all?

@maetshju
Copy link
Contributor Author

maetshju commented Jan 3, 2021

I would prefer to have the onecold representation, and I don't think there are really any cons. The onehot expectation was a holdover from when I first impelemented the function during GSoC '18 and was still coming to grips with how CTC was implemented. The only sort of plus for having the onehot representation is that it is parallel to the standard format for cross entropy calculations, but I don't think it's a particularly compelling reason to keep the onehot representation.

I'm currently re-running my test networks to verify that the loss still works with the onecold representation and can push those changes once I see that the networks still train.

@maetshju
Copy link
Contributor Author

My test networks have now trained to a similar level of accuracy as when we had the onehot representation. This commit also removes the F function from the CTC code because it isn't used anymore in calculating the loss. If we would like to provide such a function, I think it might fit in utils, but it's really more of a function that's used to interpret the network outputs, so I'm not sure that we necessarily need to provide that here.

test/ctc.jl Outdated Show resolved Hide resolved
test/ctc.jl Outdated Show resolved Hide resolved
test/ctc.jl Outdated Show resolved Hide resolved
src/losses/ctc.jl Outdated Show resolved Hide resolved
test/ctc.jl Outdated Show resolved Hide resolved
maetshju and others added 2 commits January 16, 2021 15:53
src/losses/ctc.jl Outdated Show resolved Hide resolved
src/losses/ctc.jl Outdated Show resolved Hide resolved
@maetshju
Copy link
Contributor Author

Sorry about that. My text editor inserted some tab characters for some reason but didn't display the file any differently on my end. I have replaced them with spaces and checked the other files in the PR for any errant tabs (no other files seemed to have been affected though).

@CarloLucibello
Copy link
Member

It's been a long journey, thanks for your effort!

bors r+

bors bot added a commit that referenced this pull request Jan 20, 2021
1287: Add CTC loss to new Losses module r=CarloLucibello a=maetshju

This is a redux of adding the connectionist temporal classification loss from #342, now that the Losses module has been merged in #1264. Discussion in #342 suggested that a new PR would be easier than rebasing.

Since the last commit in #342, functions and data structures from `CUDAnative.jl` and `CuArrays.jl` have been updated to work with `CUDA.jl`. This is in addition to incorporating the loss function into the Losses module.

### PR Checklist

- [X] Tests are added
- [X] Entry in NEWS.md
- [X] Documentation, if applicable
- [ ] Final review from `@dhairyagandhi96` (for API changes).


Co-authored-by: Matt Kelley <matthew.curtis.kelley@gmail.com>
Co-authored-by: Matthew C. Kelley <matthew.curtis.kelley@gmail.com>
@CarloLucibello
Copy link
Member

bors r+

@bors
Copy link
Contributor

bors bot commented Jan 20, 2021

Build succeeded:

@bors bors bot merged commit dd28321 into FluxML:master Jan 20, 2021
@DhairyaLGandhi
Copy link
Member

@CarloLucibello this does count as an api change and was merged errantly. We still don't want different implementations for cpu and GPU, and this shouldn't change that.

@DhairyaLGandhi
Copy link
Member

Thanks for all the work @maetshju, hopefully we would be able to see the two I'm implementations converge soon. We might need to tweak some things in cuda.jl, so it's good to have a reference to compare against

@maetshju maetshju mentioned this pull request Jan 20, 2021
@maetshju
Copy link
Contributor Author

Thank you @CarloLucibello and @DhairyaLGandhi for all your attention and reviews on this code. Please don't hesitate to tag me in related future discussions or bug reports if you would like my input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants