Scoring #538

mjpost · 2018-09-18T13:52:44Z

This implements scoring by fully reusing the training computation graph, per @bricksdont's original suggestion. It replaces PR #413.

It's nearly done, but I thought I'd submit it as WIP since it makes some changes required to generalize many of the procedures and I want to be sure they look okay to you. These include:

Making validation data optional when creating data iterators
Adding a new data iterator fill_up policy that pads an uneven batch with zeros and does not randomly permute batches

The CLI is similar to training: you pass it either --prepared-data or --source X --target Y. It includes length penalty parameters. Currently output is the model score and the source side sentence. I'll parameterize these in the future since it would be convenient to be able to directly output negative logprobs, for example, and perhaps other variations.

I haven't tested this for speed yet but it should be pretty fast since it's using the BucketingModule (particularly, fast relative to the inference-based version).

Pull Request Checklist

Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
Unit tests pass (pytest)
Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
System tests pass (pytest test/system)
Passed code style checking (./style-check.sh)
You have considered writing a test
Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ciyongch · 2018-09-19T06:03:55Z

From the user's perspective, when should we use ScoringModel, any scenarios? Thanks

mjpost · 2018-09-19T11:46:17Z

The most straightforward application is to use it to discover the model scores for translating one sentence to another, via the CLI. Since this PR uses the computation graph, scoring is very fast. People have used this, for example, for corpus-level filtering (§4). But there are many other things one could use it for.

bricksdont · 2018-09-19T12:03:36Z

@ciyongch Another prominent application is contrastive evaluation of NMT models.

bricksdont · 2018-09-19T12:11:06Z

Why is does the output have the source sentence but not the target? I think that a good default is to just output the model scores, one line per input sentence pair. Otherwise, we need output handlers for scoring.

mjpost · 2018-09-19T12:34:22Z

That was just an arbitrary choice that I used for debugging (the sentences weren't processed in order until I figured out how to turn off random permutations in the data iterator). I agree it should just output the score, with command-line parameters for optionally outputting other information. I'll add those.

Options (defaults are first items) `score-type` = {neglogprob, logprob} `output` = {score, id, source, target}

fhieber · 2018-09-25T16:23:29Z

There is still a failing integration test (on assert about scores equality)

mjpost · 2018-09-25T16:33:09Z

Yeah, it runs fine finally. I'm not sure what the issue is but will figure it out.

mjpost · 2018-09-25T17:52:25Z

It looks like the issue is with --skip-topk. This makes sense, but I don't understand why it's passing locally.

mjpost · 2018-09-25T18:59:19Z

Okay, sometimes the outputs were empty due to max_seq_len being overrun, leading to the tests trivially passing.

@tdomhan, do you know what the intended semantics of max_seq_len (for training) is? Is it supposed to include the hidden <eos> symbol or no?

the reason being that we don't know which lines have had the length penalty applied

mjpost · 2018-09-26T02:28:38Z

This last pytest failure seems to be that some poorly-trained models result in translations that contain multiple <unk>and <s> tokens, which sockeye.score then does not score correctly.

mjpost · 2018-09-26T19:25:06Z

Okay, the tests are finally passing. If the churn here gives you cause for concern, please be assuaged by the following explanation: the general approach I've taken in writing the tests is to run sockeye.score against the output of sockeye.translate in common.py. However, the models that are trained here for testing are very poor, so the following occurs frequently and randomly:

The output is empty
The output includes garbage symbols like <s> and <unk>

Furthermore, scoring cannot work in a number of situations supported by regular inference:

Scoring doesn't support skip_softmax=True, which is set whenever the beam size is 1 and there is a single model.
Scoring skips sentences that are too long, whereas inference splits them up, translates them sequentially, and reassembles them.

Tracking these down took trial and error. When any of these situations are present, I simply skip the scoring test in common.py. I hope that I have covered them all.

fhieber

Thanks for polishing this! Looks good to me. Before merging, could you run the system tests to make sure all of these pass? Thanks!

mjpost · 2018-09-27T10:47:53Z

I did run them on two systems and that helped me uncover some of these cases. Note that Travis runs them, too, right? matt (from my phone)

…

Le 27 sept. 2018 à 02:57, Felix Hieber ***@***.***> a écrit : @fhieber approved this pull request. Thanks for polishing this! Looks good to me. Before merging, could you run the system tests to make sure all of these pass? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

fhieber · 2018-09-27T12:52:18Z

Travis runs only a small subset for commits and a larger set with the nightly cron.
Good to hear that they pass.

fhieber · 2018-09-27T12:58:54Z

One more thing: can you add an entry point for the new cli in setup.py?
Thanks!

mjpost added 10 commits September 17, 2018 19:24

made validation source optional, parameterized fill_up

4fc041f

moved training-specific checks

af1e7bd

added scoring up to generating outputs, almost finished

1322b31

works but not polished

27dd1cb

added length penalty as command-line arguments

7e8767a

added 'repeat_last' fillup strategy (wrong approach)

29ac7a4

added zero fill_up strategy, no_permute on batch iterator

d468770

print source sentence with generalized ids2tokens()

359bc60

fixed test cases

ebd4e10

Merge branch 'master' into scoring

8994f5c

mjpost requested review from davvil, fhieber, mjdenkowski and tdomhan as code owners September 18, 2018 13:52

mjpost added 7 commits September 18, 2018 13:16

style checks

0079c71

moved training-specific check to training

210a123

context

5015d54

turned on bucketing

48d00cc

turned off more data permuting

927b92b

set batch size to reasonable default for fully-unrolled graph

581805e

pulled out bucketing args, removed options from scoring

aa417dd

mjpost added 4 commits September 19, 2018 09:22

added --score-type and --output

28a5e4b

Options (defaults are first items) `score-type` = {neglogprob, logprob} `output` = {score, id, source, target}

Merge branch 'master' into scoring

ddd5d88

documentation

307e56a

style check

98bb1f1

mjpost added 3 commits September 25, 2018 13:00

style changes

3c9401d

missed one

7c48db9

don't score when --skip-topk

68f943c

mjpost added 2 commits September 25, 2018 14:19

debugging travis

7db9bef

seq max seq len very large to pass test

d67488b

mjpost mentioned this pull request Sep 25, 2018

Inconsistencies with scoring and inference #545

Closed

mjpost added 2 commits September 25, 2018 16:50

reading maxlen from config and skipping some test lines

d834a7d

the reason being that we don't know which lines have had the length penalty applied

debugging output since still failing

b731159

mjpost added 3 commits September 25, 2018 22:42

skipping test outputs with vocab symbols

8abcf1b

debugging travis

3ae7d7e

more systematic testing for when to try to score

f39ba62

mjpost mentioned this pull request Sep 26, 2018

Bug with beam-size=1? #546

Closed

don't score when translate beam == 1 or length close to max

776fcbb

mjpost added 2 commits September 26, 2018 15:53

restored skipping topk

e9808d2

updated documentation

065abc1

fhieber approved these changes Sep 27, 2018

View reviewed changes

fhieber added the feature label Sep 27, 2018

mjpost and others added 3 commits September 27, 2018 09:40

entry point for sockeye.score

b3dd468

Merge branch 'master' into scoring

b23bb01

proper spacing

2da12bc

fhieber merged commit 5a50d96 into awslabs:master Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring #538

Scoring #538

mjpost commented Sep 18, 2018 •

edited by fhieber

Loading

ciyongch commented Sep 19, 2018

mjpost commented Sep 19, 2018 •

edited

Loading

bricksdont commented Sep 19, 2018

bricksdont commented Sep 19, 2018 •

edited

Loading

mjpost commented Sep 19, 2018

fhieber commented Sep 25, 2018

mjpost commented Sep 25, 2018

mjpost commented Sep 25, 2018

mjpost commented Sep 25, 2018 •

edited

Loading

mjpost commented Sep 26, 2018 •

edited

Loading

mjpost commented Sep 26, 2018

fhieber left a comment

mjpost commented Sep 27, 2018 via email

fhieber commented Sep 27, 2018

fhieber commented Sep 27, 2018

Scoring #538

Scoring #538

Conversation

mjpost commented Sep 18, 2018 • edited by fhieber Loading

Pull Request Checklist

ciyongch commented Sep 19, 2018

mjpost commented Sep 19, 2018 • edited Loading

bricksdont commented Sep 19, 2018

bricksdont commented Sep 19, 2018 • edited Loading

mjpost commented Sep 19, 2018

fhieber commented Sep 25, 2018

mjpost commented Sep 25, 2018

mjpost commented Sep 25, 2018

mjpost commented Sep 25, 2018 • edited Loading

mjpost commented Sep 26, 2018 • edited Loading

mjpost commented Sep 26, 2018

fhieber left a comment

Choose a reason for hiding this comment

mjpost commented Sep 27, 2018 via email

fhieber commented Sep 27, 2018

fhieber commented Sep 27, 2018

mjpost commented Sep 18, 2018 •

edited by fhieber

Loading

mjpost commented Sep 19, 2018 •

edited

Loading

bricksdont commented Sep 19, 2018 •

edited

Loading

mjpost commented Sep 25, 2018 •

edited

Loading

mjpost commented Sep 26, 2018 •

edited

Loading