Description
translate.py
ignores all sentences after processing shard_size*shard_size
sentences in src
(if no tgt
file is provided). The reason for this lies in
Lines 19 to 20 in 9865443
Essentially, if no tgt
is provided, tgt_shards
becomes a list of None
's of size shard_size
, while src_shards
is a generator that will generate num_lines/shard_size
elements. When num_lines/shard_size
becomes greater than shard_size
, the rest of the elements in src_shards
are ignored, since tgt_shards
ends prematurely. An example might make this clearer:
shard_size
=2
src
:
a
b
c
d
e
f
tgt
: None
In this case, the following shards are computed
Lines 18 to 21 in 9865443
src_shards
: generator([[a,b],[c,d],[e,f]])
tgt_shards
: [None, None]
shard_pairs
: zip(src_shards, tgt_shards)
==> [ ([a,b], None), ([c,d], None) ]
[e,f]
is completely ignored, since there is no corresponding element to zip
in tgt_shards
.
The bug is that tgt_shards
should be computed using num_shards
and not shard_size
, but since we don't read the entire file, we don't know what num_shards
is at this point.
A potential solution is that when tgt
is None
, tgt_shards
becomes an infinite None
generator, in which case the zip
will be limited by number of source shards, which is what we want.
If this all makes sense, I can send in a PR. Happy to clarify and/or discuss something I might have missed!