You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, there's a couple of issues I want to discuss, mainly foucing on class WECoherencePairwise, WECoherenceCentroid, and metrics using embedding.
If these issues are reasonable, I am happy to create a PR for these issues. Thanks! :)
Take WECoherenceCentroid for example, it load word2vec-google-news-300 as default model weight. Since the file size is big (1662 MB), it cost lots of time to download the model weight, which may not be a good thing for a default behavior. Also, some research (see Table 2 in [1]) shows that GloVe (glove-wiki-gigaword-50, 65 MB) may perform better than Word2Vec.
So, I think it may be a good idea to switch default model to glove-wiki-gigaword-50, or other word2vec model with smaller model size.
Also, it seems reasonable to be able to calculate WECoherence with different embedding model. Currently you can use word2vec_path to pass any w2v format model (e.g., glove-wiki-gigaword-50 provided by gensim), but the parameter name is really confusing. So if we want to support models other than word2vec, we should change the parameter name word2vec_path to more general one.
2.3 KeyedVectors as function input:
Currently, if you want to use several embedding related metrics (e.g., use WECoherencePairwise and WECoherenceCentroid at the same time), every metric instance will create it's own embedding model, which is unnecessary. To avoid this, we can take KeyedVectors as input directly.
2.4 support dict-like embedding object.
When calculate the metrics, we only need embedding for few words in topics. So there's no need to load the whole embedding model. In this case, we can use dict-like object to store embeddings we need.
In octis, the KeyedVectors only use in operation to check embedding exist, and use __getitem__ to get the embedding. So, we can simply convert dict like object to KeyedVectors, and make KeyedVectors related functions compatible with embeddings in dictionary format as input. Also, we only need to change code in __init__, without changing complex code like WordEmbeddingsInvertedRBOCentroid.scre.
Based on 2.1, 2.2, 2.3, 2.4 we can change WECoherenceCentroid.__init__ to following, and all metrics that use embedding can load model by function _load_keyedvectors.
fromtypingimportDict, Unionfromnumpyimportndarrayfromgensim.modelsimportKeyedVectorsclassWECoherenceCentroid(AbstractMetric):
def__init__(self, topk=10, model: Union[Dict[str,ndarray],KeyedVectors, None]=None, model_name: Optional[str]=None):
''' If model is provided, use model first. If model is None, use model_name or default_model_name. '''ifmodelisNoneandmodel_nameisNone:
model_name='some_default_model_name'self._wv=_load_keyedvectors(model, model_name)
def_load_keyedvectors(model, model_name):
ifmodelisnotNone:
ifisinstance(model, KeyedVectors):
passelifisinstance(model, dict):
model=_convert_dict_to_keyedvectors(dic)
else:
raiseValueError()
elsemodel_nameisnotNone:
model=api.load(model_name)
else:
raiseValueError()
returnmodeldef_convert_dict_to_keyedvectors(dic):
vector_size=dic.values()
arr=next(iter(dic.values()))
assertlen(arr.shape) ==1vector_size=arr.shape[0]
kv=KeyedVectors(vector_size)
kv.key_to_index= {k:idxforidx, kinenumerate(dic.keys())}
kv.vectors=np.stack(list(dic.values()))
returnkv
The text was updated successfully, but these errors were encountered:
Description
Hi, there's a couple of issues I want to discuss, mainly foucing on class
WECoherencePairwise
,WECoherenceCentroid
, and metrics using embedding.If these issues are reasonable, I am happy to create a PR for these issues. Thanks! :)
1. Calculation:
In score function, the L1 normalization for word embedding and Centroid (
t
) seems to be unnecessary because it use cosine similarity later on, the score will be identical without the normalization.(See L122 and L196 in https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py)
2.1 Default word embedding model:
Take
WECoherenceCentroid
for example, it loadword2vec-google-news-300
as default model weight. Since the file size is big (1662 MB), it cost lots of time to download the model weight, which may not be a good thing for a default behavior. Also, some research (see Table 2 in [1]) shows that GloVe (glove-wiki-gigaword-50
, 65 MB) may perform better than Word2Vec.So, I think it may be a good idea to switch default model to
glove-wiki-gigaword-50
, or other word2vec model with smaller model size.([1] Ding, Ran, Ramesh Nallapati, and Bing Xiang. "Coherence-aware neural topic modeling." arXiv preprint arXiv:1809.02687 (2018).)
2.2 parameter name and function input:
Also, it seems reasonable to be able to calculate WECoherence with different embedding model. Currently you can use
word2vec_path
to pass any w2v format model (e.g.,glove-wiki-gigaword-50
provided by gensim), but the parameter name is really confusing. So if we want to support models other than word2vec, we should change the parameter nameword2vec_path
to more general one.2.3 KeyedVectors as function input:
Currently, if you want to use several embedding related metrics (e.g., use
WECoherencePairwise
andWECoherenceCentroid
at the same time), every metric instance will create it's own embedding model, which is unnecessary. To avoid this, we can takeKeyedVectors
as input directly.2.4 support dict-like embedding object.
When calculate the metrics, we only need embedding for few words in topics. So there's no need to load the whole embedding model. In this case, we can use dict-like object to store embeddings we need.
In octis, the
KeyedVectors
only usein
operation to check embedding exist, and use__getitem__
to get the embedding. So, we can simply convert dict like object to KeyedVectors, and makeKeyedVectors
related functions compatible with embeddings in dictionary format as input. Also, we only need to change code in__init__
, without changing complex code likeWordEmbeddingsInvertedRBOCentroid.scre
.Based on 2.1, 2.2, 2.3, 2.4 we can change
WECoherenceCentroid.__init__
to following, and all metrics that use embedding can load model by function_load_keyedvectors
.The text was updated successfully, but these errors were encountered: