This posts a major change to the way embeddings are calculated. While the model weights are unchanged, the main user API for embedding molecular strings has been revised, as the previous implementation did not really take into account structure as it simply perform the einsum over raw nn.Embedding
lookups.
The embed_molecule
and related methods will now actually run the word embeddings through the encoder, then perform the einsum
operation over non-padding tokens. This should now incorporate structural differences, and possibly explain why cosine similarities were very close to 1 for many molecules.
Full Changelog: v0.1.4...v0.2.0