Skip to content
This repository was archived by the owner on Oct 24, 2024. It is now read-only.

Commit 5031f8b

Browse files
committed
Merge release branch
2 parents 333fd4d + c3076d3 commit 5031f8b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+2495
-767
lines changed

.travis.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ install:
1818
- pip install annoy
1919
- pip install testfixtures
2020
- pip install unittest2
21+
- pip install scikit-learn
2122
- pip install Morfessor==2.0.2a4
2223
- python setup.py install
2324
script: python setup.py test

CHANGELOG.md

Lines changed: 86 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,72 @@ Changes
44
Unreleased:
55

66

7-
========
7+
===========
8+
9+
2.0.0, 2017-04-10
10+
11+
Breaking changes:
12+
13+
Any direct calls to method train() of Word2Vec/Doc2Vec now require an explicit epochs parameter and explicit estimate of corpus size. The most usual way to call `train` is `vec_model.train(sentences, total_examples=self.corpus_count, epochs=self.iter)`
14+
See the [method documentation](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L766) for more information.
15+
16+
17+
* Explicit epochs and corpus size in word2vec train(). (@gojomo, @robotcator, [#1139](https://github.com/RaRe-Technologies/gensim/pull/1139), [#1237](https://github.com/RaRe-Technologies/gensim/pull/1237))
18+
19+
New features:
20+
21+
* Add output word prediction in word2vec. Only for negative sampling scheme. See [ipynb]( https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb) (@chinmayapancholi13,[#1209](https://github.com/RaRe-Technologies/gensim/pull/1209))
22+
* scikit_learn wrapper for LSI Model in Gensim (@chinmayapancholi13,[#1244](https://github.com/RaRe-Technologies/gensim/pull/1244))
23+
* Add the 'keep_tokens' parameter to 'filter_extremes'. (@toliwa,[#1210](https://github.com/RaRe-Technologies/gensim/pull/1210))
24+
* Load FastText models with specified encoding (@jayantj,[#1210](https://github.com/RaRe-Technologies/gensim/pull/1189))
25+
26+
27+
Improvements:
28+
* Fix loading large FastText models on Mac. (@jaksmid,[#1196](https://github.com/RaRe-Technologies/gensim/pull/1214))
29+
* Sklearn LDA wrapper now works in sklearn pipeline (@kris-singh,[#1213](https://github.com/RaRe-Technologies/gensim/pull/1213))
30+
* glove2word2vec conversion script refactoring (@parulsethi,[#1247](https://github.com/RaRe-Technologies/gensim/pull/1247))
31+
* Word2vec error message when update called before train . Fix #1162 (@hemavakade,[#1205](https://github.com/RaRe-Technologies/gensim/pull/1205))
32+
* Allow training if model is not modified by "_minimize_model". Add deprecation warning. (@chinmayapancholi13,[#1207](https://github.com/RaRe-Technologies/gensim/pull/1207))
33+
* Update the warning text when building vocab on a trained w2v model (@prakhar2b,[#1190](https://github.com/RaRe-Technologies/gensim/pull/1190))
34+
35+
Bug fixes:
36+
37+
* Fix word2vec reset_from bug in v1.0.1 Fix #1230. (@Kreiswolke,[#1234](https://github.com/RaRe-Technologies/gensim/pull/1234))
38+
39+
* Distributed LDA: checking the length of docs instead of the boolean value, plus int index conversion (@saparina ,[#1191](https://github.com/RaRe-Technologies/gensim/pull/1191))
40+
41+
* syn0_lockf initialised with zero in intersect_word2vec_format() (@KiddoZhu,[#1267](https://github.com/RaRe-Technologies/gensim/pull/1267))
42+
43+
* Fix wordrank max_iter_dump calculation. Fix #1216 (@ajkl,[#1217](https://github.com/RaRe-Technologies/gensim/pull/1217))
44+
45+
* Make SgNegative test use sg (@shubhvachher ,[#1252](https://github.com/RaRe-Technologies/gensim/pull/1252))
46+
47+
* pep8/pycodestyle fixes for hanging indents in Summarization module (@SamriddhiJain ,[#1202](https://github.com/RaRe-Technologies/gensim/pull/1202))
48+
49+
* WordRank and Mallet wrappers single vs double quote issue in windows.(@prakhar2b,[#1208](https://github.com/RaRe-Technologies/gensim/pull/1208))
50+
51+
52+
* Fix #824 : no corpus in init, but trim_rule in init (@prakhar2b ,[#1186](https://github.com/RaRe-Technologies/gensim/pull/1186))
53+
54+
* Hardcode version number. Fix #1138. ( @tmylk, [#1138](https://github.com/RaRe-Technologies/gensim/pull/1138))
55+
56+
Tutorial and doc improvements:
57+
58+
* Color dictionary according to topic notebook update (@bhargavvader, [#1164](https://github.com/RaRe-Technologies/gensim/pull/1164))
59+
60+
* Fix hdp show_topic/s docstring ( @parulsethi, [#1264](https://github.com/RaRe-Technologies/gensim/pull/1264))
61+
62+
* Add docstrings for word2vec.py forwarding functions ( @shubhvachher, [#1251](https://github.com/RaRe-Technologies/gensim/pull/1251))
63+
64+
* updated description for worker_loop function used in score function ( @chinmayapancholi13 , [#1206](https://github.com/RaRe-Technologies/gensim/pull/1206))
65+
66+
1.0.1, 2017-03-03
67+
68+
* Rebuild cumulative table on load. Fix #1180. (@tmylk,[#1181](https://github.com/RaRe-Technologies/gensim/pull/893))
69+
* most_similar_cosmul bug fix (@dkim010, [#1177](https://github.com/RaRe-Technologies/gensim/pull/1177))
70+
* Fix loading old word2vec models pre-1.0.0 (@jayantj, [#1179](https://github.com/RaRe-Technologies/gensim/pull/1179))
71+
* Load utf-8 words in fasttext (@jayantj, [#1176](https://github.com/RaRe-Technologies/gensim/pull/1176))
72+
873

974
1.0.0, 2017-02-24
1075

@@ -28,35 +93,35 @@ Improvements:
2893
* Phrases and Phraser allow a generator corpus (ELind77 [#1099](https://github.com/RaRe-Technologies/gensim/pull/1099))
2994
* Ignore DocvecsArray.doctag_syn0norm in save. Fix #789 (@accraze,[#1053](https://github.com/RaRe-Technologies/gensim/pull/1053))
3095
* Fix bug in LsiModel that occurs when id2word is a Python 3 dictionary. (@cvangysel,[#1103](https://github.com/RaRe-Technologies/gensim/pull/1103)
31-
* Fix broken link to paper in readme (@bhargavvader,[#1101](https://github.com/RaRe-Technologies/gensim/pull/1101))
32-
* Lazy formatting in evaluate_word_pairs (@akutuzov,[#1084](https://github.com/RaRe-Technologies/gensim/pull/1084))
96+
* Fix broken link to paper in readme (@bhargavvader,[#1101](https://github.com/RaRe-Technologies/gensim/pull/1101))
97+
* Lazy formatting in evaluate_word_pairs (@akutuzov,[#1084](https://github.com/RaRe-Technologies/gensim/pull/1084))
3398
* Deacc option to keywords pre-processing (@bhargavvader,[#1076](https://github.com/RaRe-Technologies/gensim/pull/1076))
34-
* Generate Deprecated exception when using Word2Vec.load_word2vec_format (@tmylk, [#1165](https://github.com/RaRe-Technologies/gensim/pull/1165))
35-
* Fix hdpmodel constructor docstring for print_topics (#1152) (@toliwa, [#1152](https://github.com/RaRe-Technologies/gensim/pull/1152))
36-
* Default to per_word_topics=False in LDA get_item for performance (@menshikh-iv, [#1154](https://github.com/RaRe-Technologies/gensim/pull/1154))
99+
* Generate Deprecated exception when using Word2Vec.load_word2vec_format (@tmylk, [#1165](https://github.com/RaRe-Technologies/gensim/pull/1165))
100+
* Fix hdpmodel constructor docstring for print_topics (#1152) (@toliwa, [#1152](https://github.com/RaRe-Technologies/gensim/pull/1152))
101+
* Default to per_word_topics=False in LDA get_item for performance (@menshikh-iv, [#1154](https://github.com/RaRe-Technologies/gensim/pull/1154))
37102
* Fix bound computation in Author Topic models. (@olavurmortensen, [#1156](https://github.com/RaRe-Technologies/gensim/pull/1156))
38103
* Write UTF-8 byte strings in tensorboard conversion (@tmylk,[#1144](https://github.com/RaRe-Technologies/gensim/pull/1144))
39104
* Make top_topics and sparse2full compatible with numpy 1.12 strictly int idexing (@tmylk,[#1146](https://github.com/RaRe-Technologies/gensim/pull/1146))
40105

41106
Tutorial and doc improvements:
42107

43-
* Clarifying comment in is_corpus func in utils.py (@greninja,[#1109](https://github.com/RaRe-Technologies/gensim/pull/1109))
108+
* Clarifying comment in is_corpus func in utils.py (@greninja,[#1109](https://github.com/RaRe-Technologies/gensim/pull/1109))
44109
* Tutorial Topics_and_Transformations fix markdown and add references (@lgmoneda,[#1120](https://github.com/RaRe-Technologies/gensim/pull/1120))
45-
* Fix doc2vec-lee.ipynb results to match previous behavior (@bahbbc,[#1119](https://github.com/RaRe-Technologies/gensim/pull/1119))
110+
* Fix doc2vec-lee.ipynb results to match previous behavior (@bahbbc,[#1119](https://github.com/RaRe-Technologies/gensim/pull/1119))
46111
* Remove Pattern lib dependency in News Classification tutorial (@luizcavalcanti,[#1118](https://github.com/RaRe-Technologies/gensim/pull/1118))
47112
* Corpora_and_Vector_Spaces tutorial text clarification (@lgmoneda,[#1116](https://github.com/RaRe-Technologies/gensim/pull/1116))
48113
* Update Transformation and Topics link from quick start notebook (@mariana393,[#1115](https://github.com/RaRe-Technologies/gensim/pull/1115))
49114
* Quick Start Text clarification and typo correction (@luizcavalcanti,[#1114](https://github.com/RaRe-Technologies/gensim/pull/1114))
50115
* Fix typos in Author-topic tutorial (@Fil,[#1102](https://github.com/RaRe-Technologies/gensim/pull/1102))
51116
* Address benchmark inconsistencies in Annoy tutorial (@droudy,[#1113](https://github.com/RaRe-Technologies/gensim/pull/1113))
52-
* Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,[#1137](https://github.com/RaRe-Technologies/gensim/pull/1137))
53-
* Fix dependencies description on doc2vec-IMDB notebook (@luizcavalcanti, [#1132](https://github.com/RaRe-Technologies/gensim/pull/1132))
54-
* Add documentation for WikiCorpus metadata. (@kirit93, [#1163](https://github.com/RaRe-Technologies/gensim/pull/1163))
117+
* Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,[#1137](https://github.com/RaRe-Technologies/gensim/pull/1137))
118+
* Fix dependencies description on doc2vec-IMDB notebook (@luizcavalcanti, [#1132](https://github.com/RaRe-Technologies/gensim/pull/1132))
119+
* Add documentation for WikiCorpus metadata. (@kirit93, [#1163](https://github.com/RaRe-Technologies/gensim/pull/1163))
120+
55121

56-
57122
1.0.0RC2, 2017-02-16
58123

59-
* Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,[#1137](https://github.com/RaRe-Technologies/gensim/pull/1137))
124+
* Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,[#1137](https://github.com/RaRe-Technologies/gensim/pull/1137))
60125
* Remove direct access to properties moved to KeyedVectors (@tmylk,[#1147](https://github.com/RaRe-Technologies/gensim/pull/1147))
61126
* Remove support for Python 2.6, 3.3 and 3.4 (@tmylk,[#1145](https://github.com/RaRe-Technologies/gensim/pull/1145))
62127
* Write UTF-8 byte strings in tensorboard conversion (@tmylk,[#1144](https://github.com/RaRe-Technologies/gensim/pull/1144))
@@ -76,15 +141,15 @@ Improvements:
76141
* Ignore DocvecsArray.doctag_syn0norm in save. Fix #789 (@accraze,[#1053](https://github.com/RaRe-Technologies/gensim/pull/1053))
77142
* Move load and save word2vec_format out of word2vec class to KeyedVectors (@tmylk,[#1107](https://github.com/RaRe-Technologies/gensim/pull/1107))
78143
* Fix bug in LsiModel that occurs when id2word is a Python 3 dictionary. (@cvangysel,[#1103](https://github.com/RaRe-Technologies/gensim/pull/1103)
79-
* Fix broken link to paper in readme (@bhargavvader,[#1101](https://github.com/RaRe-Technologies/gensim/pull/1101))
80-
* Lazy formatting in evaluate_word_pairs (@akutuzov,[#1084](https://github.com/RaRe-Technologies/gensim/pull/1084))
144+
* Fix broken link to paper in readme (@bhargavvader,[#1101](https://github.com/RaRe-Technologies/gensim/pull/1101))
145+
* Lazy formatting in evaluate_word_pairs (@akutuzov,[#1084](https://github.com/RaRe-Technologies/gensim/pull/1084))
81146
* Deacc option to keywords pre-processing (@bhargavvader,[#1076](https://github.com/RaRe-Technologies/gensim/pull/1076))
82147

83148
Tutorial and doc improvements:
84149

85-
* Clarifying comment in is_corpus func in utils.py (@greninja,[#1109](https://github.com/RaRe-Technologies/gensim/pull/1109))
150+
* Clarifying comment in is_corpus func in utils.py (@greninja,[#1109](https://github.com/RaRe-Technologies/gensim/pull/1109))
86151
* Tutorial Topics_and_Transformations fix markdown and add references (@lgmoneda,[#1120](https://github.com/RaRe-Technologies/gensim/pull/1120))
87-
* Fix doc2vec-lee.ipynb results to match previous behavior (@bahbbc,[#1119](https://github.com/RaRe-Technologies/gensim/pull/1119))
152+
* Fix doc2vec-lee.ipynb results to match previous behavior (@bahbbc,[#1119](https://github.com/RaRe-Technologies/gensim/pull/1119))
88153
* Remove Pattern lib dependency in News Classification tutorial (@luizcavalcanti,[#1118](https://github.com/RaRe-Technologies/gensim/pull/1118))
89154
* Corpora_and_Vector_Spaces tutorial text clarification (@lgmoneda,[#1116](https://github.com/RaRe-Technologies/gensim/pull/1116))
90155
* Update Transformation and Topics link from quick start notebook (@mariana393,[#1115](https://github.com/RaRe-Technologies/gensim/pull/1115))
@@ -96,9 +161,9 @@ Tutorial and doc improvements:
96161
0.13.4.1, 2017-01-04
97162

98163
* Disable direct access warnings on save and load of Word2vec/Doc2vec (@tmylk, [#1072](https://github.com/RaRe-Technologies/gensim/pull/1072))
99-
* Making Default hs error explicit (@accraze, [#1054](https://github.com/RaRe-Technologies/gensim/pull/1054))
164+
* Making Default hs error explicit (@accraze, [#1054](https://github.com/RaRe-Technologies/gensim/pull/1054))
100165
* Removed unnecessary numpy imports (@bhargavvader, [#1065](https://github.com/RaRe-Technologies/gensim/pull/1065))
101-
* Utils and Matutils changes (@bhargavvader, [#1062](https://github.com/RaRe-Technologies/gensim/pull/1062))
166+
* Utils and Matutils changes (@bhargavvader, [#1062](https://github.com/RaRe-Technologies/gensim/pull/1062))
102167
* Tests for the evaluate_word_pairs function (@akutuzov, [#1061](https://github.com/RaRe-Technologies/gensim/pull/1061))
103168

104169
0.13.4, 2016-12-22
@@ -120,8 +185,8 @@ Tutorial and doc improvements:
120185
* Remove warning on gensim import "pattern not installed". Fix #1009 (@shashankg7, [#1018](https://github.com/RaRe-Technologies/gensim/pull/1018))
121186
* Add delete_temporary_training_data() function to word2vec and doc2vec models. (@deepmipt-VladZhukov, [#987](https://github.com/RaRe-Technologies/gensim/pull/987))
122187
* Documentation improvements (@IrinaGoloshchapova, [#1010](https://github.com/RaRe-Technologies/gensim/pull/1010), [#1011](https://github.com/RaRe-Technologies/gensim/pull/1011))
123-
* LDA tutorial by Olavur, tips and tricks (@olavurmortensen, [#779](https://github.com/RaRe-Technologies/gensim/pull/779))
124-
* Add double quote in commmand line to run on Windows (@akarazeev, [#1005](https://github.com/RaRe-Technologies/gensim/pull/1005))
188+
* LDA tutorial by Olavur, tips and tricks (@olavurmortensen, [#779](https://github.com/RaRe-Technologies/gensim/pull/779))
189+
* Add double quote in commmand line to run on Windows (@akarazeev, [#1005](https://github.com/RaRe-Technologies/gensim/pull/1005))
125190
* Fix directory names in notebooks to be OS-independent (@mamamot, [#1004](https://github.com/RaRe-Technologies/gensim/pull/1004))
126191
* Respect clip_start, clip_end in most_similar. Fix #601. (@parulsethi, [#994](https://github.com/RaRe-Technologies/gensim/pull/994))
127192
* Replace Python sigmoid function with scipy in word2vec & doc2vec (@markroxor, [#989](https://github.com/RaRe-Technologies/gensim/pull/989))

ISSUE_TEMPLATE.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
<!--
2+
If your issue is a usage or a general question, please submit it here instead:
3+
- Mailing List: https://groups.google.com/forum/#!forum/gensim
4+
For more information, see Recipes&FAQ: https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ
5+
-->
6+
7+
<!-- Instructions For Filing a Bug: https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md -->
8+
9+
#### Description
10+
TODO: change commented example
11+
<!-- Example: Vocabulary size is not what I expected when training Word2Vec. -->
12+
13+
#### Steps/Code/Corpus to Reproduce
14+
<!--
15+
Example:
16+
```
17+
from gensim.models import word2vec
18+
19+
sentences = ['human', 'machine']
20+
model = word2vec.Word2Vec(sentences)
21+
print(model.syn0.shape)
22+
```
23+
If the code is too long, feel free to put it in a public gist and link
24+
it in the issue: https://gist.github.com
25+
-->
26+
27+
#### Expected Results
28+
<!-- Example: Expected shape of (100,2).-->
29+
30+
#### Actual Results
31+
<!-- Example: Actual shape of (100,5).
32+
33+
Please paste or specifically describe the actual output or traceback. -->
34+
35+
#### Versions
36+
<!--
37+
Please run the following snippet and paste the output below.
38+
import platform; print(platform.platform())
39+
import sys; print("Python", sys.version)
40+
import numpy; print("NumPy", numpy.__version__)
41+
import scipy; print("SciPy", scipy.__version__)
42+
import gensim; print("gensim", gensim.__version__)
43+
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
44+
-->
45+
46+
47+
<!-- Thanks for contributing! -->
48+

appveyor.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ test_script:
6666
# installed library.
6767
- "mkdir empty_folder"
6868
- "cd empty_folder"
69-
- "pip install pyemd testfixtures unittest2 Morfessor==2.0.2a4"
69+
- "pip install pyemd testfixtures unittest2 sklearn Morfessor==2.0.2a4"
7070

7171
- "python -c \"import nose; nose.main()\" -s -v gensim"
7272
# Move back to the project folder

docs/notebooks/WordRank_wrapper_quickstart.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"\n",
1414
"# Train model\n",
1515
"\n",
16-
"We'll use [Lee corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor) for training which is already available in gensim. Now for Wordrank, two parameters `dump_period` and `iter` needs to be in sync as it dumps the embedding file with the start of next iteration. For example, if you want results after 10 iterations, you need to use `iter=11` and `dump_period` can be anything that gives mod 0 with resulting iteration, in this case 2 or 5.\n"
16+
"We'll use [Lee corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor) for training which is already available in gensim. Now for Wordrank, two parameters `dump_period` and `iter` needs to be in sync as it dumps the embedding file with the start of next iteration. For example, if you want results after 10 iterations, you need to use `iter=11` and `dump_period` can be anything that gives mod 0 with resulting iteration, in this case 2 or 5.\n"
1717
]
1818
},
1919
{

0 commit comments

Comments
 (0)