Skip to content

add 'Word Mover's Distance' implementation to gensim? #482

Closed
@gojomo

Description

"From Word Embeddings to Document Distances" (Kusner et al 2015: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf) introduces the "Word Mover's Distance" (WMD), a novel distance-between-text-documents measure. It is an adaptation of another distance metric, "Earth Mover's Distance" (EMD), introduced(?) in "A Metric for Distributions with Applications
to Image Databases" (Rubner et al 1998: http://www.cs.jhu.edu/~misha/Papers/Rubner98.pdf), that is useful for comparing images and can be calculated as a special case of much-older transportation problem optimizations. For text, the WMD leverages word2vec vectors of the documents' individual words, in a way that seems to outperform simple combinations (sum/mean) of those word vectors.

There's a blog post from OpenTable with some impressive examples of "similar sentences" using WMD on restaurant reviews: http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

The paper reports strong kNN classifier results on document classification, and intuitively, the method seems amenable to the selection (or even synthesis) of 'canonical' or 'borderline/contrastive' text segments, perhaps to assist iterative classification tasks.

The author's code is available as a ZIP bundle (http://matthewkusner.com/#page2) – but its python wrappers depend on some older C EMD code of unclear licensing status. There's a bunch of other EMD code around for other projects (esp. image similarity measures) that also might be a good starting point.

An implementation (perhaps in optimized cython) for gensim might be widely useful.

(Further out: it might be interesting to use WMD to induce a document-embedding. That is, train doc embeddings with random draws of 3 documents, nudging the embeddings so that the relative which-two-are-closest is the same in the embedding space as in the WMD metric.)

Metadata

Assignees

No one assigned

    Labels

    difficulty mediumMedium issue: required good gensim understanding & python skillsfeatureIssue described a new feature

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions