Skip to content

Commit

Permalink
Merge branch 'release-0.8.4'
Browse files Browse the repository at this point in the history
  • Loading branch information
piskvorky committed Mar 9, 2012
2 parents 440cf62 + e279cc9 commit 3f99dae
Show file tree
Hide file tree
Showing 76 changed files with 2,428 additions and 395 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
Changes
=======

0.8.4

* better support for Pandas series input (thx to JT Bates)
* a new corpus format: UCI bag-of-words (thx to Jonathan Esterhazy)
* a new model, non-parametric bayes: HDP (thx to Jonathan Esterhazy; based on Chong Wang's code)
* improved support for new scipy versions (thx to Skipper Seabold)
* lemmatizer support for wikipedia parsing (via the `pattern` python package)
* extended the lemmatizer for multi-core processing, to improve its performance

0.8.3

* fixed Similarity sharding bug (issue #65, thx to Paul Rudin)
Expand Down
3 changes: 3 additions & 0 deletions docs/_sources/apiref.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,18 @@ Modules:
corpora/svmlightcorpus
corpora/wikicorpus
corpora/textcorpus
corpora/ucicorpus
corpora/indexedcorpus
models/ldamodel
models/lsimodel
models/tfidfmodel
models/rpmodel
models/hdpmodel
models/logentropy_model
models/lsi_dispatcher
models/lsi_worker
models/lda_dispatcher
models/lda_worker
similarities/docsim
similarities/simserver

8 changes: 8 additions & 0 deletions docs/_sources/corpora/ucicorpus.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:mod:`corpora.ucicorpus` -- Corpus in UCI bag-of-words format
==============================================================================================================

.. automodule:: gensim.corpora.ucicorpus
:synopsis: Corpus in University of California, Irvine (UCI) bag-of-words format
:members:
:inherited-members:

3 changes: 2 additions & 1 deletion docs/_sources/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,9 @@ Quick Reference Example

.. admonition:: What's new?

* 9 Mar 2012: release 0.8.4: new model `Hierarchical Dirichlet Process <http://radimrehurek.com/gensim/models/hdpmodel.html>`_ (full `CHANGELOG <https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt>`_)
* 2 Dec 2011: bug-fix release 0.8.3 out; `CHANGELOG <https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt>`_
* 1 Dec 2011: released `simserver <http://pypi.python.org/pypi/simserver>`_, a document similarity server based on gensim
* 1 Dec 2011: released `simserver <http://pypi.python.org/pypi/simserver>`_, a Python document similarity server based on gensim



Expand Down
8 changes: 8 additions & 0 deletions docs/_sources/models/hdpmodel.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:mod:`models.hdpmodel` -- Hierarchical Dirichlet Process
========================================================

.. automodule:: gensim.models.hdpmodel
:synopsis: Hierarchical Dirichlet Process
:members:
:inherited-members:

6 changes: 3 additions & 3 deletions docs/_sources/similarities/simserver.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
:mod:`similarities.simserver` -- Document similarity server
========================================================================
:mod:`simserver` -- Document similarity server
======================================================

.. automodule:: gensim.similarities.simserver
.. automodule:: simserver.simserver
:synopsis: Document similarity server
:members:
:inherited-members:
Expand Down
20 changes: 11 additions & 9 deletions docs/_sources/simserver.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
Document Similarity Server
=============================


The 0.7.x series of `gensim <http://radimrehurek.com/gensim/>`_ was about improving performance and consolidating API.
0.8.x will be about new features --- 0.8.1, first of the series, is a **document similarity service**.

The source code itself has been moved from gensim to its own, dedicated package, named `simserver`.
Get it from `PyPI <http://pypi.python.org/pypi/simserver>`_ or clone it on `Github <https://github.com/piskvorky/gensim-simserver>`_.

What is a document similarity service?
---------------------------------------

Expand All @@ -17,7 +19,7 @@ Conceptually, a service that lets you :
3. query the index for similar documents (the query can be either an id of a document already in the index, or an arbitrary text)


>>> from gensim.similarities.simserver import SessionServer
>>> from simserver import SessionServer
>>> server = SessionServer('/tmp/my_server') # resume server (or create a new one)

>>> server.train(training_corpus, method='lsi') # create a semantic model
Expand Down Expand Up @@ -130,7 +132,7 @@ recommended client splits them into smaller chunks before uploading them to the
Wait, upload what, where?
-------------------------

If you use the similarity service object (instance of :class:`gensim.similarities.simserver.SessionServer`) in
If you use the similarity service object (instance of :class:`simserver.SessionServer`) in
your code directly---no remote access---that's perfectly fine. Using the service remotely, from a different process/machine, is an
option, not a necessity.

Expand All @@ -140,7 +142,7 @@ case, I'll call the service object a *server*.
But let's start with a local object. Open your `favourite shell <http://ipython.org/>`_ and::

>>> from gensim import utils
>>> from gensim.similarities.simserver import SessionServer
>>> from simserver import SessionServer
>>> service = SessionServer('/tmp/my_server/') # or wherever

That initialized a new service, located in `/tmp/my_server` (you need write access rights to that directory).
Expand Down Expand Up @@ -238,11 +240,11 @@ a pure Python package for Remote Procedure Calls (RPC), so I'll illustrate remot
service access via Pyro. Pyro takes care of all the socket listening/request routing/data marshalling/thread
spawning, so it saves us a lot of trouble.

To create a similarity server, we just create a :class:`gensim.similarities.simserver.SessionServer` object and register it
with a Pyro daemon for remote access. There is a small `example script <https://github.com/piskvorky/gensim/blob/simserver/gensim/test/run_simserver.py>`_
included with gensim, run it with::
To create a similarity server, we just create a :class:`simserver.SessionServer` object and register it
with a Pyro daemon for remote access. There is a small `example script <https://github.com/piskvorky/gensim-simserver/blob/master/simserver/run_simserver.py>`_
included with simserver, run it with::

$ python -m gensim.test.run_simserver /tmp/testserver
$ python -m simserver.run_simserver /tmp/testserver

You can just `ctrl+c` to terminate the server, but leave it running for now.

Expand Down Expand Up @@ -324,5 +326,5 @@ Other stuff
------------

TODO Custom document parsing (in lieu of `utils.simple_preprocess`). Different models (not just `lsi`). Optimizing the index with `service.optimize()`.
TODO add some hard numbers; example tutorial for some bigger collection, e.g. arxiv.org or wikipedia.
TODO add some hard numbers; example tutorial for some bigger collection, e.g. for `arxiv.org <http://aura.fi.muni.cz:8080/>`_ or wikipedia.

13 changes: 12 additions & 1 deletion docs/_sources/tut2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -212,8 +212,17 @@ Gensim implements several popular Vector Space Model algorithms:
`gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,
modified to run in :doc:`distributed mode <distributed>` on a cluster of computers.

* `Hierarchical Dirichlet Process, HDP <http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf>`_
is a non-parametric bayesian method (note the missing number of requested topics):

>>> model = hdpmodel.HdpModel(bow_corpus, id2word=dictionary)

`gensim` uses a fast, online implementation based on [3]_.
The HDP model is a new addition to `gensim`, and still rough around its academic edges -- use with care.

Adding new :abbr:`VSM (Vector Space Model)` transformations (such as different weighting schemes) is rather trivial;
see the :doc:`API reference <apiref>` or directly the Python code for more info and examples.
see the :doc:`API reference <apiref>` or directly the `Python code <https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py>`_
for more info and examples.

It is worth repeating that these are all unique, **incremental** implementations,
which do not require the whole training corpus to be present in main memory all at once.
Expand All @@ -230,6 +239,8 @@ Continue on to the next tutorial on :doc:`tut3`.

.. [2] Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation.

.. [3] Wang, Paisley, Blei. 2011. Online variational inference for the hierarchical Dirichlet process.

.. [4] Halko, Martinsson, Tropp. 2009. Finding structure with randomness.

.. [5] Řehůřek. 2011. Subspace tracking for Latent Semantic Analysis.
6 changes: 3 additions & 3 deletions docs/about.html
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '',
VERSION: '0.8.3',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
Expand Down Expand Up @@ -207,8 +207,8 @@ <h3>Navigation</h3>


<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Dec 02, 2011.
&copy; Copyright 2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Mar 09, 2012.
</div>
</body>
</html>
9 changes: 6 additions & 3 deletions docs/apiref.html
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '',
VERSION: '0.8.3',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
Expand Down Expand Up @@ -126,11 +126,13 @@ <h3>Quick search</h3>
<li class="toctree-l1"><a class="reference internal" href="corpora/svmlightcorpus.html"><tt class="docutils literal"><span class="pre">corpora.svmlightcorpus</span></tt> &#8211; Corpus in SVMlight format</a></li>
<li class="toctree-l1"><a class="reference internal" href="corpora/wikicorpus.html"><tt class="docutils literal"><span class="pre">corpora.wikicorpus</span></tt> &#8211; Corpus from a Wikipedia dump</a></li>
<li class="toctree-l1"><a class="reference internal" href="corpora/textcorpus.html"><tt class="docutils literal"><span class="pre">corpora.textcorpus</span></tt> &#8211; Building corpora with dictionaries</a></li>
<li class="toctree-l1"><a class="reference internal" href="corpora/ucicorpus.html"><tt class="docutils literal"><span class="pre">corpora.ucicorpus</span></tt> &#8211; Corpus in UCI bag-of-words format</a></li>
<li class="toctree-l1"><a class="reference internal" href="corpora/indexedcorpus.html"><tt class="docutils literal"><span class="pre">corpora.indexedcorpus</span></tt> &#8211; Random access to corpus documents</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/ldamodel.html"><tt class="docutils literal"><span class="pre">models.ldamodel</span></tt> &#8211; Latent Dirichlet Allocation</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/lsimodel.html"><tt class="docutils literal"><span class="pre">models.lsimodel</span></tt> &#8211; Latent Semantic Indexing</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/tfidfmodel.html"><tt class="docutils literal"><span class="pre">models.tfidfmodel</span></tt> &#8211; TF-IDF model</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/rpmodel.html"><tt class="docutils literal"><span class="pre">models.rpmodel</span></tt> &#8211; Random Projections</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/hdpmodel.html"><tt class="docutils literal"><span class="pre">models.hdpmodel</span></tt> &#8211; Hierarchical Dirichlet Process</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/logentropy_model.html"><tt class="docutils literal"><span class="pre">models.logentropy_model</span></tt> &#8211; LogEntropy model</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/lsi_dispatcher.html"><tt class="docutils literal"><span class="pre">models.lsi_dispatcher</span></tt> &#8211; Dispatcher for distributed LSI</a></li>
<li class="toctree-l1"><a class="reference internal" href="models/lsi_worker.html"><tt class="docutils literal"><span class="pre">models.lsi_worker</span></tt> &#8211; Worker for distributed LSI</a></li>
Expand All @@ -140,6 +142,7 @@ <h3>Quick search</h3>
<li class="toctree-l2"><a class="reference internal" href="similarities/docsim.html#how-it-works">How It Works</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="similarities/simserver.html"><tt class="docutils literal"><span class="pre">simserver</span></tt> &#8211; Document similarity server</a></li>
</ul>
</div>
</div>
Expand Down Expand Up @@ -178,8 +181,8 @@ <h3>Navigation</h3>


<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Dec 02, 2011.
&copy; Copyright 2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Mar 09, 2012.
</div>
</body>
</html>
6 changes: 3 additions & 3 deletions docs/changes_080.html
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '',
VERSION: '0.8.3',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
Expand Down Expand Up @@ -208,8 +208,8 @@ <h3>Navigation</h3>


<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Dec 02, 2011.
&copy; Copyright 2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Mar 09, 2012.
</div>
</body>
</html>
6 changes: 3 additions & 3 deletions docs/corpora/bleicorpus.html
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '../',
VERSION: '0.8.3',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
Expand Down Expand Up @@ -224,8 +224,8 @@ <h3>Navigation</h3>


<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Dec 02, 2011.
&copy; Copyright 2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Mar 09, 2012.
</div>
</body>
</html>
6 changes: 3 additions & 3 deletions docs/corpora/corpora.html
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '../',
VERSION: '0.8.3',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
Expand Down Expand Up @@ -130,8 +130,8 @@ <h3>Navigation</h3>


<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Dec 02, 2011.
&copy; Copyright 2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Mar 09, 2012.
</div>
</body>
</html>
6 changes: 3 additions & 3 deletions docs/corpora/dictionary.html
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '../',
VERSION: '0.8.3',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
Expand Down Expand Up @@ -259,8 +259,8 @@ <h3>Navigation</h3>


<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Dec 02, 2011.
&copy; Copyright 2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Mar 09, 2012.
</div>
</body>
</html>
16 changes: 8 additions & 8 deletions docs/corpora/indexedcorpus.html
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '../',
VERSION: '0.8.3',
VERSION: '0.8.4',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
Expand All @@ -29,7 +29,7 @@
<link rel="top" title="gensim" href="../index.html" />
<link rel="up" title="API Reference" href="../apiref.html" />
<link rel="next" title="models.ldamodel – Latent Dirichlet Allocation" href="../models/ldamodel.html" />
<link rel="prev" title="corpora.textcorpusBuilding corpora with dictionaries" href="textcorpus.html" />
<link rel="prev" title="corpora.ucicorpusCorpus in UCI bag-of-words format" href="ucicorpus.html" />


<!-- twitter search widget
Expand Down Expand Up @@ -68,7 +68,7 @@ <h3>Navigation</h3>
<a href="../models/ldamodel.html" title="models.ldamodel – Latent Dirichlet Allocation"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="textcorpus.html" title="corpora.textcorpusBuilding corpora with dictionaries"
<a href="ucicorpus.html" title="corpora.ucicorpusCorpus in UCI bag-of-words format"
accesskey="P">previous</a> |</li>
<li><a href="../index.html">Gensim home</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a>|&nbsp;</li>
Expand All @@ -84,8 +84,8 @@ <h3>Navigation</h3>
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h4>Previous topic</h4>
<p class="topless"><a href="textcorpus.html"
title="previous chapter"><tt class="docutils literal"><span class="pre">corpora.textcorpus</span></tt> &#8211; Building corpora with dictionaries</a></p>
<p class="topless"><a href="ucicorpus.html"
title="previous chapter"><tt class="docutils literal"><span class="pre">corpora.ucicorpus</span></tt> &#8211; Corpus in UCI bag-of-words format</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="../models/ldamodel.html"
title="next chapter"><tt class="docutils literal"><span class="pre">models.ldamodel</span></tt> &#8211; Latent Dirichlet Allocation</a></p>
Expand Down Expand Up @@ -222,7 +222,7 @@ <h3>Navigation</h3>
<a href="../models/ldamodel.html" title="models.ldamodel – Latent Dirichlet Allocation"
>next</a> |</li>
<li class="right" >
<a href="textcorpus.html" title="corpora.textcorpusBuilding corpora with dictionaries"
<a href="ucicorpus.html" title="corpora.ucicorpusCorpus in UCI bag-of-words format"
>previous</a> |</li>
<li><a href="../index.html">Gensim home</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a>|&nbsp;</li>
Expand All @@ -236,8 +236,8 @@ <h3>Navigation</h3>


<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Dec 02, 2011.
&copy; Copyright 2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Mar 09, 2012.
</div>
</body>
</html>
Loading

0 comments on commit 3f99dae

Please sign in to comment.