Skip to content

potential 'alias method' negative-sampling optimization from 'Koan' paper #3292

Open
@gojomo

Description

Per discussion on #1873, while I'm not convinced the 'Koan' paper about CBOW has identified an error or clear results-improvement, they do appear to have a faster negative-sampling method. Per my comment there:

Their use of an 'alias method' for negative-sampling looks like a separate, functionally-neutral but performance-enhancing change. It could be a nice separate optimization. I don't fully understand it yet – I think it'd use more memory than our current 'binary-search-over-list-of-split-points' approach, but remains O(1) (ignoring cache-locality effects) for arbitrarily large vocabularies. (The original word2vec.c approach of pre-filling a giant array was O(1) at a massive cost in memory that then also degraded its real performance due to cache misses, vs the binary-search that I replaced it with a long while back.)

While there's a chance it might not meet Gensim's needs due to memory overhead, or if it is overly complicated for a small benefit, it's worth investigating further as a potential enhancement. It's likely a small & self-contained change that'd be straightforward to tackle, & test/evaluate, separate from everything else. (Maybe: "good first issue" for new contributor?)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions