Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoring #538

Merged
merged 60 commits into from
Sep 28, 2018
Merged

Scoring #538

merged 60 commits into from
Sep 28, 2018

Conversation

mjpost
Copy link
Contributor

@mjpost mjpost commented Sep 18, 2018

This implements scoring by fully reusing the training computation graph, per @bricksdont's original suggestion. It replaces PR #413.

It's nearly done, but I thought I'd submit it as WIP since it makes some changes required to generalize many of the procedures and I want to be sure they look okay to you. These include:

  • Making validation data optional when creating data iterators
  • Adding a new data iterator fill_up policy that pads an uneven batch with zeros and does not randomly permute batches

The CLI is similar to training: you pass it either --prepared-data or --source X --target Y. It includes length penalty parameters. Currently output is the model score and the source side sentence. I'll parameterize these in the future since it would be convenient to be able to directly output negative logprobs, for example, and perhaps other variations.

I haven't tested this for speed yet but it should be pretty fast since it's using the BucketingModule (particularly, fast relative to the inference-based version).

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
  • System tests pass (pytest test/system)
  • Passed code style checking (./style-check.sh)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ciyongch
Copy link

From the user's perspective, when should we use ScoringModel, any scenarios? Thanks

@mjpost
Copy link
Contributor Author

mjpost commented Sep 19, 2018

The most straightforward application is to use it to discover the model scores for translating one sentence to another, via the CLI. Since this PR uses the computation graph, scoring is very fast. People have used this, for example, for corpus-level filtering (§4). But there are many other things one could use it for.

@bricksdont
Copy link
Contributor

@ciyongch Another prominent application is contrastive evaluation of NMT models.

@bricksdont
Copy link
Contributor

bricksdont commented Sep 19, 2018

Why is does the output have the source sentence but not the target? I think that a good default is to just output the model scores, one line per input sentence pair. Otherwise, we need output handlers for scoring.

@mjpost
Copy link
Contributor Author

mjpost commented Sep 19, 2018

That was just an arbitrary choice that I used for debugging (the sentences weren't processed in order until I figured out how to turn off random permutations in the data iterator). I agree it should just output the score, with command-line parameters for optionally outputting other information. I'll add those.

Options (defaults are first items)
`score-type` = {neglogprob, logprob}
`output` = {score, id, source, target}
@fhieber
Copy link
Contributor

fhieber commented Sep 25, 2018

There is still a failing integration test (on assert about scores equality)

@mjpost
Copy link
Contributor Author

mjpost commented Sep 25, 2018

Yeah, it runs fine finally. I'm not sure what the issue is but will figure it out.

@mjpost
Copy link
Contributor Author

mjpost commented Sep 25, 2018

It looks like the issue is with --skip-topk. This makes sense, but I don't understand why it's passing locally.

@mjpost
Copy link
Contributor Author

mjpost commented Sep 25, 2018

Okay, sometimes the outputs were empty due to max_seq_len being overrun, leading to the tests trivially passing.

@tdomhan, do you know what the intended semantics of max_seq_len (for training) is? Is it supposed to include the hidden <eos> symbol or no?

the reason being that we don't know which lines have had the length penalty applied
@mjpost
Copy link
Contributor Author

mjpost commented Sep 26, 2018

This last pytest failure seems to be that some poorly-trained models result in translations that contain multiple <unk>and <s> tokens, which sockeye.score then does not score correctly.

@mjpost mjpost mentioned this pull request Sep 26, 2018
@mjpost
Copy link
Contributor Author

mjpost commented Sep 26, 2018

Okay, the tests are finally passing. If the churn here gives you cause for concern, please be assuaged by the following explanation: the general approach I've taken in writing the tests is to run sockeye.score against the output of sockeye.translate in common.py. However, the models that are trained here for testing are very poor, so the following occurs frequently and randomly:

  • The output is empty
  • The output includes garbage symbols like <s> and <unk>

Furthermore, scoring cannot work in a number of situations supported by regular inference:

  • Scoring doesn't support skip_softmax=True, which is set whenever the beam size is 1 and there is a single model.
  • Scoring skips sentences that are too long, whereas inference splits them up, translates them sequentially, and reassembles them.

Tracking these down took trial and error. When any of these situations are present, I simply skip the scoring test in common.py. I hope that I have covered them all.

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for polishing this! Looks good to me. Before merging, could you run the system tests to make sure all of these pass? Thanks!

@mjpost
Copy link
Contributor Author

mjpost commented Sep 27, 2018 via email

@fhieber
Copy link
Contributor

fhieber commented Sep 27, 2018

Travis runs only a small subset for commits and a larger set with the nightly cron.
Good to hear that they pass.

@fhieber
Copy link
Contributor

fhieber commented Sep 27, 2018

One more thing: can you add an entry point for the new cli in setup.py?
Thanks!

@fhieber fhieber merged commit 5a50d96 into awslabs:master Sep 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants