Description
I've been using this codebase to handle some new datasets, it did help me a lot, but I found a few places where there might be bugs or unclear descriptions.
- length of target wasn't cut to max_output_len for pretrain models, if that exceed max_len, in
MWPToolkit/mwptoolkit/model/PreTrain/robertagen.py, line 173 or bertgen.py line 173
decoder_inputs = self.pos_embedder(self.out_embedder(target))
the sequence length would exceed pos_embedder's max length
- for GTS, the code is not generalized for datasets with constants other than 1 and 3.14 and thus cause tensor size mismatch
(mwptoolkit/model/Seq2Tree/gts.py) ~line 904
if mask_flag: num_score2[i][:2] = -1e10 # for the first iterations, do not generate 1 and 3.14
-
there might be bugs in processing " from_prefix_to_infix" and "from_infix_to_prefix" in the preprocessing tools:
If you try to map this equation to prefix and map it back:
1500/(((100+12)-(100-12))/100)
it will yield this, where the relation between 100+12 and 100-12 is not correct.
1500/(100+12-100-12)/100
and for */, it would ignore () as well:
1/(1-(1/(2*2))) would be mapped to 1/(1-1/2*2) -
Another small problem, every time when feeding a batch, it will re-preprocess the data. This would include much redundant computation if we run many epochs.
Thanks again for this tool!