Skip to content

Commit

Permalink
[model] new word2vec model added which trained on whole wikipedia cor…
Browse files Browse the repository at this point in the history
…pus.
  • Loading branch information
akkefa committed Jun 30, 2018
1 parent f8d5cb8 commit ab9bd76
Show file tree
Hide file tree
Showing 5 changed files with 519 additions and 9 deletions.
25 changes: 17 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,36 @@
# ML Models for the Urdu language.
# NLP Models for the Urdu language.

[![Price](https://img.shields.io/badge/price-FREE-0098f7.svg)](https://github.com/urduhack/models/blob/master/LICENSE)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/urduhack/models/blob/master/LICENSE)

Collection of pretrained vector models and NLP models for the Urdu language.
Collection of pretrained ML and NLP models for the Urdu language.

## Table of contents

- [Pretrained models](#pretrained_models)
- [Word2vec Models](#Word2vec)
- [Bugs and feature requests](#bugs-and-feature-requests)
- [Contributors](#contributors)
- [Copyright and license](#copyright-and-license)


## Pretrained_models
## Word2vec

Collections of multiple vector models trained on different data.
Word2vec is a widely used for learns relationships between words and converting word into vector.

#### Word2vec
- Trained on 50,000 web news data.
- Download word2vec model link [download](https://drive.google.com/uc?export=download&id=13KLg3wUTOwWiF_YdAtZFe18j7MQmOWfb).
### Web News Data model

- Trained on 50,000 web news posts.
- Semantic Accuracy: 36.89%
- Syntactic Accuracy: 31.25%
- Demo (https://github.com/urduhack/models/blob/master/pretrained_models/word2vec/web_news_data)

### Wikipedia Data model

- Trained on whole wikipedia corpus.
- Semantic Accuracy: 59.59%
- Syntactic Accuracy: 37.50%
- Demo (https://github.com/urduhack/models/blob/master/pretrained_models/word2vec/wikipedia)

## Bugs and feature requests

Have a bug or a feature request? If you wish to remove or update some of the features, please file an issue first before sending a PR on the repo. [please open a new issue](https://github.com/urduhack/models/issues/new).
Expand Down
2 changes: 1 addition & 1 deletion pretrained_models/word2vec/web_news_data/model.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Londing the pretrained Urdu word2vec 300 dimension vector model\n",
"## Loading the pretrained Urdu word2vec 300 dimension vector model\n",
"\n",
"This model trainied on 50,000 news posts data."
]
Expand Down
84 changes: 84 additions & 0 deletions pretrained_models/word2vec/wikipedia/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Pretrained Word2vec model on whole wikipedia corpus.

## Model Description

- Model trained on 78482 urdu wikipedia posts.
- Model vector size is 300.
- Download link (https://drive.google.com/uc?export=download&id=1yz8RfJeg65QByLs1aJUORtPujHYx_oQP)
- Semantic Accuracy: 59.59%
- Syntactic Accuracy: 37.50%
- Model can be load using python gensim package.

## Code

```python

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


model = gensim.models.KeyedVectors.load_word2vec_format('urdu_wikipedia_vector300.bin', binary=True)

print(model.most_similar("پاکستان"))
[('افغانستان', 0.534391462802887)
, ('پاکستانی', 0.527515172958374)
, ('بھارت', 0.5176973342895508)
, ('زمبابوے', 0.5033701062202454) ]

print(model.most_similar(positive=['دہلی', 'پاکستان'], negative=['پنجاب'],topn=5))
[('دلی', 0.47923001646995544)
, ('انڈیا', 0.4310738444328308)
, ('بھارت', 0.4303123652935028)
, ('پاکستانی', 0.42918506264686584)
, ('بیجنگ', 0.42072150111198425)]

print(model.most_similar(positive=['ٹوکیو', 'پاکستان'], negative=['اسلام_آباد']))
[('جاپان', 0.518461287021637),
('جاپانی', 0.42522647976875305),
('بھارت', 0.3991791605949402),
('دنیا', 0.3974219858646393),
('چین', 0.3774305582046509),
('اوساکا', 0.3636421859264374),
('جاپان،', 0.35131868720054626),
('انڈیا', 0.3293466866016388),
('عالمی', 0.32560476660728455),
('جاپانیوں', 0.3245166540145874)]

print(model.most_similar(positive=['بھائی', 'لڑکی'], negative=['لڑکا']))
[('بہن', 0.5513333082199097),
('والد', 0.532108724117279),
('بیٹی', 0.5085018873214722),
('والدہ', 0.48878273367881775),
('کوقتل', 0.46216732263565063),
('بھائیوں', 0.45481085777282715),
('پولیس', 0.4398535490036011),
('باپ', 0.439206600189209),
('کزن', 0.417349249124527),
('خاتون', 0.4159335494041443)]

print(model.most_similar(positive=['دلہن', 'شوہر'], negative=['دولہا']))
[('بیوی', 0.6536001563072205),
('خاوند', 0.6006074547767639),
('طلاق', 0.5600955486297607),
('خاتون', 0.5458393692970276),
('شادی', 0.5421558022499084),
('بیٹی', 0.5145429968833923),
('اداکارہ', 0.4982667863368988),
('ماں', 0.4932785630226135),
('عورت', 0.476948082447052),
('اہلیہ', 0.4722379744052887)]

print(model.most_similar(positive=['ملکہ', 'باپ'], negative=['بادشاہ']))
[('ماں', 0.5100770592689514),
('بیٹی', 0.4709329605102539),
('بیٹے', 0.42628371715545654),
('رشتے', 0.3735599219799042),
('بیٹوں', 0.3722909986972809),
('بہو', 0.37172698974609375),
('بیوی', 0.3640066385269165),
('بچی', 0.36133867502212524),
('شوہر', 0.36050519347190857),
('بہن', 0.3537209630012512)]


```
Empty file.
Loading

0 comments on commit ab9bd76

Please sign in to comment.