[model] new word2vec model added which trained on whole wikipedia cor…

…pus.
MaazAmjad · Jun 30, 2018 · ab9bd76 · ab9bd76
1 parent f8d5cb8
commit ab9bd76
Show file tree

Hide file tree

Showing 5 changed files with 519 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -1,27 +1,36 @@
-# ML Models for the Urdu language.
+# NLP Models for the Urdu language.
 
 [![Price](https://img.shields.io/badge/price-FREE-0098f7.svg)](https://github.com/urduhack/models/blob/master/LICENSE)
 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/urduhack/models/blob/master/LICENSE)
 
-Collection of pretrained vector models and NLP models for the Urdu language.
+Collection of pretrained ML and NLP models for the Urdu language.
 
 ## Table of contents
 
-- [Pretrained models](#pretrained_models)
+- [Word2vec Models](#Word2vec)
 - [Bugs and feature requests](#bugs-and-feature-requests)
 - [Contributors](#contributors)
 - [Copyright and license](#copyright-and-license)
 
 
-## Pretrained_models
+## Word2vec
 
-Collections of multiple vector models trained on different data.
+Word2vec is a widely used for learns relationships between words and converting word into vector.
 
-#### Word2vec
-- Trained on 50,000 web news data.
-- Download word2vec model link [download](https://drive.google.com/uc?export=download&id=13KLg3wUTOwWiF_YdAtZFe18j7MQmOWfb).
+### Web News Data model
+
+- Trained on 50,000 web news posts.
+- Semantic Accuracy: 36.89%
+- Syntactic Accuracy: 31.25%
 - Demo (https://github.com/urduhack/models/blob/master/pretrained_models/word2vec/web_news_data)
 
+### Wikipedia Data model
+
+- Trained on whole wikipedia corpus.
+- Semantic Accuracy: 59.59%
+- Syntactic Accuracy: 37.50%
+- Demo (https://github.com/urduhack/models/blob/master/pretrained_models/word2vec/wikipedia)
+
 ## Bugs and feature requests
 
 Have a bug or a feature request? If you wish to remove or update some of the features, please file an issue first before sending a PR on the repo. [please open a new issue](https://github.com/urduhack/models/issues/new).

diff --git a/pretrained_models/word2vec/web_news_data/model.ipynb b/pretrained_models/word2vec/web_news_data/model.ipynb
@@ -13,7 +13,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Londing the pretrained Urdu word2vec 300 dimension vector model\n",
+    "## Loading the pretrained Urdu word2vec 300 dimension vector model\n",
     "\n",
     "This model trainied on 50,000 news posts data."
    ]

diff --git a/pretrained_models/word2vec/wikipedia/README.md b/pretrained_models/word2vec/wikipedia/README.md
@@ -0,0 +1,84 @@
+# Pretrained Word2vec model on whole wikipedia corpus.
+
+## Model Description
+
+- Model trained on 78482 urdu wikipedia posts.
+- Model vector size is 300.
+- Download link (https://drive.google.com/uc?export=download&id=1yz8RfJeg65QByLs1aJUORtPujHYx_oQP)
+- Semantic Accuracy: 59.59%
+- Syntactic Accuracy: 37.50%
+- Model can be load using python gensim package.
+
+## Code
+
+```python
+
+import gensim, logging
+logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
+
+
+model = gensim.models.KeyedVectors.load_word2vec_format('urdu_wikipedia_vector300.bin', binary=True)
+
+print(model.most_similar("پاکستان"))
+[('افغانستان', 0.534391462802887)
+, ('پاکستانی', 0.527515172958374)
+, ('بھارت', 0.5176973342895508)
+, ('زمبابوے', 0.5033701062202454) ]
+
+print(model.most_similar(positive=['دہلی', 'پاکستان'], negative=['پنجاب'],topn=5))
+[('دلی', 0.47923001646995544)
+, ('انڈیا', 0.4310738444328308)
+, ('بھارت', 0.4303123652935028)
+, ('پاکستانی', 0.42918506264686584)
+, ('بیجنگ', 0.42072150111198425)]
+
+print(model.most_similar(positive=['ٹوکیو', 'پاکستان'], negative=['اسلام_آباد']))
+[('جاپان', 0.518461287021637),
+ ('جاپانی', 0.42522647976875305),
+ ('بھارت', 0.3991791605949402),
+ ('دنیا', 0.3974219858646393),
+ ('چین', 0.3774305582046509),
+ ('اوساکا', 0.3636421859264374),
+ ('جاپان،', 0.35131868720054626),
+ ('انڈیا', 0.3293466866016388),
+ ('عالمی', 0.32560476660728455),
+ ('جاپانیوں', 0.3245166540145874)]
+
+print(model.most_similar(positive=['بھائی', 'لڑکی'], negative=['لڑکا']))
+[('بہن', 0.5513333082199097),
+ ('والد', 0.532108724117279),
+ ('بیٹی', 0.5085018873214722),
+ ('والدہ', 0.48878273367881775),
+ ('کوقتل', 0.46216732263565063),
+ ('بھائیوں', 0.45481085777282715),
+ ('پولیس', 0.4398535490036011),
+ ('باپ', 0.439206600189209),
+ ('کزن', 0.417349249124527),
+ ('خاتون', 0.4159335494041443)]
+
+print(model.most_similar(positive=['دلہن', 'شوہر'], negative=['دولہا']))
+[('بیوی', 0.6536001563072205),
+ ('خاوند', 0.6006074547767639),
+ ('طلاق', 0.5600955486297607),
+ ('خاتون', 0.5458393692970276),
+ ('شادی', 0.5421558022499084),
+ ('بیٹی', 0.5145429968833923),
+ ('اداکارہ', 0.4982667863368988),
+ ('ماں', 0.4932785630226135),
+ ('عورت', 0.476948082447052),
+ ('اہلیہ', 0.4722379744052887)]
+
+print(model.most_similar(positive=['ملکہ', 'باپ'], negative=['بادشاہ']))
+[('ماں', 0.5100770592689514),
+ ('بیٹی', 0.4709329605102539),
+ ('بیٹے', 0.42628371715545654),
+ ('رشتے', 0.3735599219799042),
+ ('بیٹوں', 0.3722909986972809),
+ ('بہو', 0.37172698974609375),
+ ('بیوی', 0.3640066385269165),
+ ('بچی', 0.36133867502212524),
+ ('شوہر', 0.36050519347190857),
+ ('بہن', 0.3537209630012512)]
+
+
+```
diff --git a/pretrained_models/word2vec/wikipedia/__init__.py b/pretrained_models/word2vec/wikipedia/__init__.py