Description
Pre-#2698, keyedvectors.py
was 2500+ lines, including functionality over-specific to other models, & redundant classes. Post-#2698, with some added generic functionality, it's still over 1800 lines.
It should shed some other grab-bag utility functions that have accumulated, & don't logically fit inside the KeyedVectors
class.
In particular, the evaluation (analogies, word_ranks) helpers could move to their own module that takes a KV instance as an argument. (If other more-sophisticated evaluations can be contributed, as would be welcome, they should also live alongside those, rather than bloating KeyedVectors
.)
The get_keras_embedding
method, as its utilit is narrow to very specific uses, and is conditional on a not-necessarily install package, could go elsewhere too – either a kera-focused utilities module, or even just documentation/example code about how to convert to/from keras from `KeyedVectors.
Some of the more advanced word-vector-using calculations, like 'Word Mover's Distance' or 'Soft Cosine SImilarity', could move to method-specific modules that are then better documented/self-contained/optimized, without bloating the generic 'set of vectors' module. (They might be more discoverable, there, as well.)
And finally, some of the existing calculations could be unified/streamlined (especially the two variants of most_similar()
, and some of the steps shared by multiple operations). My hope would be the module is eventually <1000 lines.