-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scoring #538
Scoring #538
Conversation
From the user's perspective, when should we use |
The most straightforward application is to use it to discover the model scores for translating one sentence to another, via the CLI. Since this PR uses the computation graph, scoring is very fast. People have used this, for example, for corpus-level filtering (§4). But there are many other things one could use it for. |
@ciyongch Another prominent application is contrastive evaluation of NMT models. |
Why is does the output have the source sentence but not the target? I think that a good default is to just output the model scores, one line per input sentence pair. Otherwise, we need output handlers for scoring. |
That was just an arbitrary choice that I used for debugging (the sentences weren't processed in order until I figured out how to turn off random permutations in the data iterator). I agree it should just output the score, with command-line parameters for optionally outputting other information. I'll add those. |
Options (defaults are first items) `score-type` = {neglogprob, logprob} `output` = {score, id, source, target}
There is still a failing integration test (on assert about scores equality) |
Yeah, it runs fine finally. I'm not sure what the issue is but will figure it out. |
It looks like the issue is with |
Okay, sometimes the outputs were empty due to @tdomhan, do you know what the intended semantics of |
the reason being that we don't know which lines have had the length penalty applied
This last pytest failure seems to be that some poorly-trained models result in translations that contain multiple |
Okay, the tests are finally passing. If the churn here gives you cause for concern, please be assuaged by the following explanation: the general approach I've taken in writing the tests is to run
Furthermore, scoring cannot work in a number of situations supported by regular inference:
Tracking these down took trial and error. When any of these situations are present, I simply skip the scoring test in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for polishing this! Looks good to me. Before merging, could you run the system tests to make sure all of these pass? Thanks!
I did run them on two systems and that helped me uncover some of these cases. Note that Travis runs them, too, right?
matt (from my phone)
… Le 27 sept. 2018 à 02:57, Felix Hieber ***@***.***> a écrit :
@fhieber approved this pull request.
Thanks for polishing this! Looks good to me. Before merging, could you run the system tests to make sure all of these pass? Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Travis runs only a small subset for commits and a larger set with the nightly cron. |
One more thing: can you add an entry point for the new cli in setup.py? |
This implements scoring by fully reusing the training computation graph, per @bricksdont's original suggestion. It replaces PR #413.
It's nearly done, but I thought I'd submit it as WIP since it makes some changes required to generalize many of the procedures and I want to be sure they look okay to you. These include:
fill_up
policy that pads an uneven batch with zeros and does not randomly permute batchesThe CLI is similar to training: you pass it either
--prepared-data
or--source X --target Y
. It includes length penalty parameters. Currently output is the model score and the source side sentence. I'll parameterize these in the future since it would be convenient to be able to directly output negative logprobs, for example, and perhaps other variations.I haven't tested this for speed yet but it should be pretty fast since it's using the BucketingModule (particularly, fast relative to the inference-based version).
Pull Request Checklist
until you can check this box.
pytest
)pytest test/system
)./style-check.sh
)sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.