potential 'alias method' negative-sampling optimization from 'Koan' paper

Per discussion on #1873, while I'm not convinced the 'Koan' paper about CBOW has identified an error or clear results-improvement, they do appear to have a faster negative-sampling method. Per my [comment](https://github.com/RaRe-Technologies/gensim/issues/1873#issuecomment-753668405) there: 

> Their use of an 'alias method' for negative-sampling looks like a separate, functionally-neutral but performance-enhancing change. It could be a nice separate optimization. I don't fully understand it yet – I think it'd use more memory than our current 'binary-search-over-list-of-split-points' approach, but remains O(1) (ignoring cache-locality effects) for arbitrarily large vocabularies. (The original `word2vec.c` approach of pre-filling a giant array was O(1) at a massive cost in memory that then also degraded its real performance due to cache misses, vs the binary-search that I replaced it with a long while back.)

While there's a chance it might not meet Gensim's needs due to memory overhead, or if it is overly complicated for a small benefit, it's worth investigating further as a potential enhancement. It's likely a small & self-contained change that'd be straightforward to tackle, & test/evaluate, separate from everything else. (Maybe: "good first issue" for new contributor?)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

potential 'alias method' negative-sampling optimization from 'Koan' paper #3292

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

potential 'alias method' negative-sampling optimization from 'Koan' paper #3292

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions