String interning #21

vzhong · 2018-08-22T21:39:41Z

@jekbradbury and @bmccann recently discovered a huge performance oversight in another tokenization library by @jekbradbury. Namely, string interning improved DecaNLP performance by something like 100x. It dawned on me that we don't seem to do this for this python client? So the output annotations are storing a bazillion copies of words, gloss, tags, whitespaces etc? Can you confirm/deny this?

For reference the issue in question is here: jekbradbury/revtok#4

vzhong added the enhancement label Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String interning #21

String interning #21

vzhong commented Aug 22, 2018 •

edited

Loading

String interning #21

String interning #21

Comments

vzhong commented Aug 22, 2018 • edited Loading

vzhong commented Aug 22, 2018 •

edited

Loading