Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

translate.py ignores sentences in src when tgt not provided #1317

Closed
fdalvi opened this issue Feb 21, 2019 · 2 comments
Closed

translate.py ignores sentences in src when tgt not provided #1317

fdalvi opened this issue Feb 21, 2019 · 2 comments

Comments

@fdalvi
Copy link
Contributor

fdalvi commented Feb 21, 2019

translate.py ignores all sentences after processing shard_size*shard_size sentences in src (if no tgt file is provided). The reason for this lies in

OpenNMT-py/translate.py

Lines 19 to 20 in 9865443

tgt_shards = split_corpus(opt.tgt, opt.shard_size) \
if opt.tgt is not None else [None]*opt.shard_size

Essentially, if no tgt is provided, tgt_shards becomes a list of None's of size shard_size, while src_shards is a generator that will generate num_lines/shard_size elements. When num_lines/shard_size becomes greater than shard_size, the rest of the elements in src_shards are ignored, since tgt_shards ends prematurely. An example might make this clearer:

shard_size=2
src:

a
b
c
d
e
f

tgt: None

In this case, the following shards are computed

OpenNMT-py/translate.py

Lines 18 to 21 in 9865443

src_shards = split_corpus(opt.src, opt.shard_size)
tgt_shards = split_corpus(opt.tgt, opt.shard_size) \
if opt.tgt is not None else [None]*opt.shard_size
shard_pairs = zip(src_shards, tgt_shards)

src_shards: generator([[a,b],[c,d],[e,f]])
tgt_shards: [None, None]

shard_pairs: zip(src_shards, tgt_shards) ==> [ ([a,b], None), ([c,d], None) ]

[e,f] is completely ignored, since there is no corresponding element to zip in tgt_shards.

The bug is that tgt_shards should be computed using num_shards and not shard_size, but since we don't read the entire file, we don't know what num_shards is at this point.

A potential solution is that when tgt is None, tgt_shards becomes an infinite None generator, in which case the zip will be limited by number of source shards, which is what we want.

If this all makes sense, I can send in a PR. Happy to clarify and/or discuss something I might have missed!

@vince62s
Copy link
Member

Good catch but:

  1. why would you use a shard_size < 10k and subsequently do you need to translate more than 100M sentences at once ???
  2. sure send a PR, always welcome.
    cheers.

@fdalvi
Copy link
Contributor Author

fdalvi commented Feb 21, 2019

  1. I actually stumbled upon this bug when i was trying to decode an endless stream (like stdin), and wanted to translate each sentence when it became available (so I set shard_size as 1 and encountered the bug)
  2. Will do!

fdalvi added a commit to fdalvi/OpenNMT-py that referenced this issue Feb 21, 2019
`translate.py` would ignore all sentences after processing
`shard_size*shard_size` sentences in `src`, if no `tgt` file
was provided. This commit fixes this.
vince62s pushed a commit that referenced this issue Feb 24, 2019
`translate.py` would ignore all sentences after processing
`shard_size*shard_size` sentences in `src`, if no `tgt` file
was provided. This commit fixes this.
ItaySofer pushed a commit to ItaySofer/OpenNMT-py that referenced this issue Mar 17, 2019
`translate.py` would ignore all sentences after processing
`shard_size*shard_size` sentences in `src`, if no `tgt` file
was provided. This commit fixes this.
ItaySofer pushed a commit to ItaySofer/OpenNMT-py that referenced this issue Mar 17, 2019
`translate.py` would ignore all sentences after processing
`shard_size*shard_size` sentences in `src`, if no `tgt` file
was provided. This commit fixes this.
goncalomcorreia pushed a commit to goncalomcorreia/open-nmt that referenced this issue Apr 17, 2019
`translate.py` would ignore all sentences after processing
`shard_size*shard_size` sentences in `src`, if no `tgt` file
was provided. This commit fixes this.
pryo pushed a commit to pryo/openNMT that referenced this issue Aug 15, 2019
`translate.py` would ignore all sentences after processing
`shard_size*shard_size` sentences in `src`, if no `tgt` file
was provided. This commit fixes this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants