Merge branch 'release-0.10.2'

hsingjun0 · Sep 18, 2014 · afd70ff · afd70ff
2 parents 62c9237 + b53d632
commit afd70ff
Show file tree

Hide file tree

Showing 31 changed files with 17,205 additions and 318 deletions.
diff --git a/CHANGELOG.txt b/CHANGELOG.txt
@@ -1,8 +1,21 @@
 Changes
 =======
 
-0.10.1
+0.10.2, 18/09/2014
 
+* new parallelized, LdaMulticore implementation (Jan Zikes, #232)
+* Dynamic Topic Models (DTM) wrapper (Arttii, #205)
+* word2vec compiled from bundled C file at install time: no more pyximport (#233)
+* standardize show_/print_topics in LdaMallet (Benjamin Bray, #223)
+* add new word2vec multiplicative objective (3CosMul) of Levy & Goldberg (Gordon Mohr, #224)
+* preserve case in MALLET wrapper (mcburton, #222)
+* support for matrix-valued topic/word prior eta in LdaModel (mjwillson, #208)
+* py3k fix to SparseCorpus (Andreas Madsen, #234)
+* fix to LowCorpus when switching dictionaries (Christopher Corley, #237)
+
+0.10.1, 22/07/2014
+
+* word2vec: new n_similarity method for comparing two sets of words (François Scharffe, #219)
 * make LDA print/show topics parameters consistent with LSI (Bram Vandekerckhove, #201)
 * add option for efficient word2vec subsampling (Gordon Mohr, #206)
 * fix length calculation for corpora on empty files (Christopher Corley, #209)

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -8,5 +8,5 @@ include COPYING
 include COPYING.LESSER
 include ez_setup.py
 include gensim/models/voidptr.h
+include gensim/models/word2vec_inner.c
 include gensim/models/word2vec_inner.pyx
-include gensim_addons/models/word2vec_inner.pyx
diff --git a/README.rst b/README.rst
@@ -25,9 +25,9 @@ Features
   * easy to plug in your own input corpus/datastream (trivial streaming API)
   * easy to extend with other Vector Space algorithms (trivial transformation API)
 
-* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis (LSA/LSI)**,
+* Efficient multicore implementations of popular algorithms, such as online **Latent Semantic Analysis (LSA/LSI)**,
   **Latent Dirichlet Allocation (LDA)**, **Random Projections (RP)**, **Hierarchical Dirichlet Process (HDP)**  or **word2vec deep learning**.
-* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers, and *word2vec* on multiple cores.
+* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers.
 * Extensive `HTML documentation and tutorials <http://radimrehurek.com/gensim/>`_.
 
 
@@ -45,19 +45,19 @@ It is also recommended you install a fast BLAS library before installing NumPy.
 
 The simple way to install `gensim` is::
 
-    sudo easy_install gensim
+    pip install -U gensim
 
 Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
 you'll need to run::
 
     python setup.py test
-    sudo python setup.py install
+    python setup.py install
 
 
 For alternative modes of installation (without root privileges, development
 installation, optional install features), see the `documentation <http://radimrehurek.com/gensim/install.html>`_.
 
-This version has been tested under Python 2.6, 2.7 and 3.3. Gensim's github repo is hooked to `Travis CI for automated testing <https://travis-ci.org/piskvorky/gensim>`_ on every commit push and pull request.
+This version has been tested under Python 2.6, 2.7, 3.3 and 3.4 (support for Python 2.5 was dropped in gensim 0.10.0; install gensim 0.9.1 if you *must* use Python 2.5). Gensim's github repo is hooked to `Travis CI for automated testing <https://travis-ci.org/piskvorky/gensim>`_ on every commit push and pull request.
 
 How come gensim is so fast and memory efficient? Isn't it pure Python, and isn't Python slow and greedy?
 --------------------------------------------------------------------------------------------------------

diff --git a/docs/src/about.rst b/docs/src/about.rst
@@ -8,7 +8,7 @@ History
 --------
 
 Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics Library `dml.cz <http://dml.cz/>`_ in 2008,
-where it served to generate a short list of the most similar articles to a given article (gensim = "generate similar").
+where it served to generate a short list of the most similar articles to a given article (**gensim = "generate similar"**).
 I also wanted to try these fancy "Latent Semantic Methods", but the libraries that
 realized the necessary computation were `not much fun to work with <http://soi.stanford.edu/~rmunk/PROPACK/>`_.
 
@@ -39,9 +39,9 @@ the source code of these modifications.
 Apart from that, you are free to redistribute gensim in any way you like, though you're
 not allowed to modify its license (doh!).
 
-My intent here is, of course, to get more help and community involvement with the development of gensim.
+My intent here is, of course, to **get more help and community involvement** with the development of gensim.
 The legalese is therefore less important to me than your input and contributions.
-Contact me if LGPL doesn't fit your bill but you'd still like to use it -- we'll work something out.
+Contact me if LGPL doesn't fit your bill but you'd still like to use gensim -- we'll work something out.
 
 .. seealso::
 
@@ -56,7 +56,7 @@ Contributors
 --------------
 
 Credit goes to all the people who contributed to gensim, be it in `discussions <http://groups.google.com/group/gensim>`_,
-ideas, `code contributions <https://github.com/piskvorky/gensim/pulls>`_ or bug reports.
+ideas, `code contributions <https://github.com/piskvorky/gensim/pulls>`_ or `bug reports <https://github.com/piskvorky/gensim/issues>`_.
 It's really useful and motivating to get feedback, in any shape or form, so big thanks to you all!
 
 Some honorable mentions are included in the `CHANGELOG.txt <https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt>`_.
@@ -65,7 +65,7 @@ Some honorable mentions are included in the `CHANGELOG.txt <https://github.com/p
 Academic citing
 ----------------
 
-Gensim has been used in many students' final theses as well as research papers. When citing gensim,
+Gensim has been used in `many students' final theses as well as research papers <http://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC>`_. When citing gensim,
 please use `this BibTeX entry <bibtex_gensim.bib>`_::
 
   @inproceedings{rehurek_lrec,

diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst
@@ -22,6 +22,7 @@ Modules:
     corpora/ucicorpus
     corpora/indexedcorpus
     models/ldamodel
+    models/ldamulticore
     models/ldamallet
     models/lsimodel
     models/tfidfmodel
@@ -33,6 +34,7 @@ Modules:
     models/lda_dispatcher
     models/lda_worker
     models/word2vec
+    models/dtmmodel
     similarities/docsim
     similarities/simserver
 
diff --git a/docs/src/conf.py b/docs/src/conf.py
@@ -52,9 +52,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = '0.10.1'
+version = '0.10.2'
 # The full version, including alpha/beta/rc tags.
-release = '0.10.1'
+release = '0.10.2'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/src/models/dtmmodel.rst b/docs/src/models/dtmmodel.rst
@@ -0,0 +1,7 @@
+:mod:`models.dtmmodel` -- Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)
+=======================================================================================
+
+.. automodule:: gensim.models.dtmmodel
+    :synopsis: Dynamic Topic Models
+    :members:
+    :inherited-members:
diff --git a/docs/src/models/ldamulticore.rst b/docs/src/models/ldamulticore.rst
@@ -0,0 +1,7 @@
+:mod:`models.ldamulticore` -- parallelized Latent Dirichlet Allocation
+======================================================================
+
+.. automodule:: gensim.models.ldamulticore
+    :synopsis: Latent Dirichlet Allocation
+    :members:
+    :inherited-members:
diff --git a/gensim/corpora/lowcorpus.py b/gensim/corpora/lowcorpus.py
@@ -81,7 +81,6 @@ def __init__(self, fname, id2word=None, line2words=split_on_space):
         else:
             logger.info("using provided word mapping (%i ids)" % len(id2word))
             self.id2word = id2word
-        self.word2id = dict((v, k) for k, v in iteritems(self.id2word))
         self.num_terms = len(self.word2id)
         self.use_wordids = True # return documents as (wordIndex, wordCount) 2-tuples
 
@@ -179,4 +178,13 @@ def docbyoffset(self, offset):
             f.seek(offset)
             return self.line2doc(f.readline())
 
+    @property
+    def id2word(self):
+        return self._id2word
+
+    @id2word.setter
+    def id2word(self, val):
+        self._id2word = val
+        self.word2id = dict((v, k) for k, v in iteritems(val))
+
 # endclass LowCorpus
diff --git a/gensim/corpora/wikicorpus.py b/gensim/corpora/wikicorpus.py
@@ -184,10 +184,8 @@ def extract_pages(f, filter_namespaces=False):
     """
     Extract pages from MediaWiki database dump.
 
-    Returns
-    -------
-    pages : iterable over (str, str)
-        Generates (title, content) pairs.
+    Return an iterable over (str, str) which generates (title, content) pairs.
+
     """
     elems = (elem for _, elem in iterparse(f, events=("end",)))
 

diff --git a/gensim/matutils.py b/gensim/matutils.py
@@ -307,7 +307,7 @@ def __init__(self, sparse, documents_columns=True):
 
     def __iter__(self):
         for indprev, indnow in izip(self.sparse.indptr, self.sparse.indptr[1:]):
-            yield zip(self.sparse.indices[indprev:indnow], self.sparse.data[indprev:indnow])
+            yield list(zip(self.sparse.indices[indprev:indnow], self.sparse.data[indprev:indnow]))
 
     def __len__(self):
         return self.sparse.shape[1]

diff --git a/gensim/models/__init__.py b/gensim/models/__init__.py
@@ -12,6 +12,8 @@
 from .rpmodel import RpModel
 from .logentropy_model import LogEntropyModel
 from .word2vec import Word2Vec
+from .ldamulticore import LdaMulticore
+from .dtmmodel import DtmModel
 
 from gensim import interfaces, utils