potential 'alias method' negative-sampling optimization from 'Koan' paper #3292
Description
Per discussion on #1873, while I'm not convinced the 'Koan' paper about CBOW has identified an error or clear results-improvement, they do appear to have a faster negative-sampling method. Per my comment there:
Their use of an 'alias method' for negative-sampling looks like a separate, functionally-neutral but performance-enhancing change. It could be a nice separate optimization. I don't fully understand it yet – I think it'd use more memory than our current 'binary-search-over-list-of-split-points' approach, but remains O(1) (ignoring cache-locality effects) for arbitrarily large vocabularies. (The original
word2vec.c
approach of pre-filling a giant array was O(1) at a massive cost in memory that then also degraded its real performance due to cache misses, vs the binary-search that I replaced it with a long while back.)
While there's a chance it might not meet Gensim's needs due to memory overhead, or if it is overly complicated for a small benefit, it's worth investigating further as a potential enhancement. It's likely a small & self-contained change that'd be straightforward to tackle, & test/evaluate, separate from everything else. (Maybe: "good first issue" for new contributor?)