You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@jekbradbury and @bmccann recently discovered a huge performance oversight in another tokenization library by @jekbradbury. Namely, string interning improved DecaNLP performance by something like 100x. It dawned on me that we don't seem to do this for this python client? So the output annotations are storing a bazillion copies of words, gloss, tags, whitespaces etc? Can you confirm/deny this?
Hey @arunchaganty ,
@jekbradbury and @bmccann recently discovered a huge performance oversight in another tokenization library by @jekbradbury. Namely, string interning improved DecaNLP performance by something like 100x. It dawned on me that we don't seem to do this for this python client? So the output annotations are storing a bazillion copies of words, gloss, tags, whitespaces etc? Can you confirm/deny this?
For reference the issue in question is here: jekbradbury/revtok#4
The text was updated successfully, but these errors were encountered: