Skip to content

Commit

Permalink
Merge branch 'release-0.9.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
piskvorky committed Apr 12, 2014
2 parents 95b1034 + 98d32d9 commit 3520fa3
Show file tree
Hide file tree
Showing 31 changed files with 315 additions and 111 deletions.
12 changes: 11 additions & 1 deletion CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
Changes
=======

0.9.1, 12/04/2014

* MmCorpus fix for Windows
* LdaMallet support for printing/showing topics
* fix LdaMallet bug when user specified a file prefix (Victor, #184)
* fix LdaMallet output when input is single vector (Suvir)
* added LdaMallet unit tests
* more py3k fixes (Lars Buitinck)
* change order of LDA topic printing (Fayimora Femi-Balogun, #188)

0.9.0, 16/03/2014

* save/load automatically single out large arrays + allow mmap
* allow .gz/.bz2 corpus filenames => transparently (de)compressed I/O
* CBOW model for word2vec (Sébastien Jean, #176)
* new API for storing corpus metadata (Joseph Chang, #169)
Expand All @@ -11,7 +22,6 @@ Changes
* better Wikipedia article parsing (Joseph Chang, #170)
* word2vec load_word2vec_format uses less memory (Yves Raimond, #164)
* load/store vocabulary files for word2vec C format (Yves Raimond, #172)
* save/load automatically single out large arrays + allow mmap
* HDP estimation on new documents (Elliot Kulakow, #153)
* store labels in SvmLight corpus (Ritesh, #152)
* fix word2vec binary load on Windows (Stephanus van Schalkwyk)
Expand Down
14 changes: 10 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
==============================================
gensim -- Python Framework for Topic Modelling
gensim -- Topic Modelling in Python
==============================================

|Travis|_
|Downloads|_
|License|_

.. |Travis| image:: https://api.travis-ci.org/piskvorky/gensim.png?branch=develop
.. |Downloads| image:: https://pypip.in/d/gensim/badge.png
.. |License| image:: https://pypip.in/license/gensim/badge.png
.. _Travis: https://travis-ci.org/piskvorky/gensim
.. _Downloads: https://pypi.python.org/pypi/gensim
.. _License: http://radimrehurek.com/gensim/about.html

Gensim is a Python library for *topic modelling*, *document indexing* and *similarity retrieval* with large corpora.
Target audience is the *natural language processing* (NLP) and *information retrieval* (IR) community.
Expand All @@ -19,8 +25,8 @@ Features
* easy to plug in your own input corpus/datastream (trivial streaming API)
* easy to extend with other Vector Space algorithms (trivial transformation API)

* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis**,
**Latent Dirichlet Allocation**, **Random Projections** or **word2vec deep learning**.
* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis (LSA/LSI)**,
**Latent Dirichlet Allocation (LDA)**, **Random Projections (RP)**, **Hierarchical Dirichlet Process (HDP)** or **word2vec deep learning**.
* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers, and *word2vec* on multiple cores.
* Extensive `HTML documentation and tutorials <http://radimrehurek.com/gensim/>`_.

Expand All @@ -35,7 +41,7 @@ Installation
This software depends on `NumPy and Scipy <http://www.scipy.org/Download>`_, two Python packages for scientific computing.
You must have them installed prior to installing `gensim`.

It is also recommended you install a fast BLAS library prior to installing NumPy. This is optional, but using an optimized BLAS such as `ATLAS <http://math-atlas.sourceforge.net/>`_ or `OpenBLAS <http://xianyi.github.io/OpenBLAS/>`_ is known to improve performance by as much as an order of magnitude.
It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as `ATLAS <http://math-atlas.sourceforge.net/>`_ or `OpenBLAS <http://xianyi.github.io/OpenBLAS/>`_ is known to improve performance by as much as an order of magnitude. On OS X, NumPy picks up the BLAS that comes with it automatically, so you don't need to do anything special.

The simple way to install `gensim` is::

Expand Down
1 change: 1 addition & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Modules:
corpora/ucicorpus
corpora/indexedcorpus
models/ldamodel
models/ldamallet
models/lsimodel
models/tfidfmodel
models/rpmodel
Expand Down
6 changes: 3 additions & 3 deletions docs/src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,16 @@

# General information about the project.
project = u'gensim'
copyright = u'2009-2014, Radim Řehůřek <radimrehurek(at)seznam.cz>'
copyright = u'2009-now, Radim Řehůřek <radimrehurek(at)seznam.cz>'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.9.0'
version = '0.9.1'
# The full version, including alpha/beta/rc tags.
release = '0.9.0'
release = '0.9.1'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
2 changes: 1 addition & 1 deletion docs/src/gensim_theme/layout.html
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ <h3><a href="http://radimrehurek.com/">Get Expert Help</a></h3>
<a href="{{ pathto('index') }}"><img src="{{ pathto('_static/images/gensim-footer.png', 1) }}" alt="gensim footer image" title="Gensim home" /></a>

<div class="copyright">
&copy; Copyright 2009-2014, <a href="mailto:radimrehurek@seznam.cz" style="color:white"> Radim Řehůřek</a>
&copy; Copyright 2009-now, <a href="mailto:radimrehurek@seznam.cz" style="color:white"> Radim Řehůřek</a>
<br />
{%- if last_updated %}
{% trans last_updated=last_updated|e %}Last updated on {{ last_updated }}.{% endtrans %}
Expand Down
2 changes: 1 addition & 1 deletion docs/src/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Introduction
============

Gensim is a :ref:`free <availability>` Python framework designed to automatically extract semantic
Gensim is a :ref:`free <availability>` Python library designed to automatically extract semantic
topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.


Expand Down
8 changes: 8 additions & 0 deletions docs/src/models/ldamallet.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:mod:`models.ldamallet` -- Latent Dirichlet Allocation via Mallet
=================================================================

.. automodule:: gensim.models.ldamallet
:synopsis: Latent Dirichlet Allocation via Mallet
:members:
:inherited-members:

4 changes: 2 additions & 2 deletions gensim/corpora/bleicorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
num_terms = 1 + max([-1] + id2word.keys())

logger.info("storing corpus in Blei's LDA-C format into %s" % fname)
with utils.smart_open(fname, 'wb') as fout:
with utils.smart_open(fname, 'w') as fout:
offsets = []
for doc in corpus:
doc = list(doc)
Expand All @@ -119,7 +119,7 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
# write out vocabulary, in a format compatible with Blei's topics.py script
fname_vocab = fname + '.vocab'
logger.info("saving vocabulary of %i words to %s" % (num_terms, fname_vocab))
with open(fname_vocab, 'w') as fout:
with open(fname_vocab, 'wb') as fout:
for featureid in xrange(num_terms):
fout.write("%s\n" % utils.to_utf8(id2word.get(featureid, '---')))

Expand Down
6 changes: 4 additions & 2 deletions gensim/corpora/dictionary.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

import logging
import itertools
import UserDict

from gensim import utils
from gensim._six import iteritems, iterkeys, itervalues, string_types
Expand All @@ -29,7 +30,7 @@
logger = logging.getLogger('gensim.corpora.dictionary')


class Dictionary(utils.SaveLoad, dict):
class Dictionary(utils.SaveLoad, UserDict.DictMixin):
"""
Dictionary encapsulates the mapping between normalized words and their integer ids.
Expand Down Expand Up @@ -238,7 +239,7 @@ def save_as_text(self, fname):
Note: use `save`/`load` to store in binary format instead (pickle).
"""
logger.info("saving dictionary mapping to %s" % fname)
with utils.smart_open(fname, 'wb') as fout:
with utils.smart_open(fname, 'w') as fout:
for token, tokenid in sorted(iteritems(self.token2id)):
fout.write("%i\t%s\t%i\n" % (tokenid, token, self.dfs.get(tokenid, 0)))

Expand Down Expand Up @@ -336,6 +337,7 @@ def from_corpus(corpus):
# now make sure length(result) == get_max_id(corpus) + 1
for i in xrange(max_id + 1):
result.token2id[str(i)] = i
result.dfs[i] = result.dfs.get(i, 0)

logger.info("built %s from %i documents (total %i corpus positions)" %
(result, result.num_docs, result.num_pos))
Expand Down
4 changes: 2 additions & 2 deletions gensim/corpora/hashdictionary.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ def restricted_hash(self, token):
Calculate id of the given token. Also keep track of what words were mapped
to what ids, for debugging reasons.
"""
h = self.myhash(token) % self.id_range
h = self.myhash(utils.to_utf8(token)) % self.id_range
if self.debug:
self.token2id[token] = h
self.id2token.setdefault(h, set()).add(token)
Expand Down Expand Up @@ -222,7 +222,7 @@ def save_as_text(self, fname):
Note: use `save`/`load` to store in binary format instead (pickle).
"""
logger.info("saving HashDictionary mapping to %s" % fname)
with utils.smart_open(fname, 'wb') as fout:
with utils.smart_open(fname, 'w') as fout:
for tokenid in self.keys():
words = sorted(self[tokenid])
if words:
Expand Down
4 changes: 2 additions & 2 deletions gensim/corpora/indexedcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,8 @@ def serialize(serializer, fname, corpus, id2word=None, index_fname=None, progres

def __len__(self):
"""
Return cached corpus length if the corpus is indexed. Otherwise delegate
`len()` call to base class.
Return the index length if the corpus is indexed. Otherwise, make a pass
over self to calculate the corpus length and cache this number.
"""
if self.index is not None:
return len(self.index)
Expand Down
7 changes: 4 additions & 3 deletions gensim/corpora/lowcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,11 +116,12 @@ def line2doc(self, line):
use_words.append(word)
marker.add(word)
# construct a list of (wordIndex, wordFrequency) 2-tuples
doc = zip(map(self.word2id.get, use_words), map(words.count, use_words)) # using list.count is suboptimal but speed of this whole function is irrelevant
doc = list(zip(map(self.word2id.get, use_words),
map(words.count, use_words)))
else:
uniq_words = set(words)
# construct a list of (word, wordFrequency) 2-tuples
doc = zip(uniq_words, map(words.count, uniq_words)) # using list.count is suboptimal but that's irrelevant at this point
doc = list(zip(uniq_words, map(words.count, uniq_words)))

# return the document, then forget it and move on to the next one
# note that this way, only one doc is stored in memory at a time, not the whole corpus
Expand Down Expand Up @@ -152,7 +153,7 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
logger.info("storing corpus in List-Of-Words format into %s" % fname)
truncated = 0
offsets = []
with utils.smart_open(fname, 'wb') as fout:
with utils.smart_open(fname, 'w') as fout:
fout.write('%i\n' % len(corpus))
for doc in corpus:
words = []
Expand Down
2 changes: 1 addition & 1 deletion gensim/corpora/malletcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):

truncated = 0
offsets = []
with utils.smart_open(fname, 'wb') as fout:
with utils.smart_open(fname, 'w') as fout:
for doc_id, doc in enumerate(corpus):
if metadata:
doc_id, doc_lang = doc[1]
Expand Down
2 changes: 1 addition & 1 deletion gensim/corpora/svmlightcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ def save_corpus(fname, corpus, id2word=None, labels=False, metadata=False):
logger.info("converting corpus to SVMlight format: %s" % fname)

offsets = []
with utils.smart_open(fname, 'wb') as fout:
with utils.smart_open(fname, 'w') as fout:
for docno, doc in enumerate(corpus):
label = labels[docno] if labels else 0 # target class is 0 by default
offsets.append(fout.tell())
Expand Down
2 changes: 1 addition & 1 deletion gensim/corpora/textcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@

from gensim import interfaces, utils
from gensim._six import string_types
from dictionary import Dictionary
from gensim.corpora.dictionary import Dictionary

logger = logging.getLogger('gensim.corpora.textcorpus')

Expand Down
2 changes: 1 addition & 1 deletion gensim/corpora/ucicorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ def save_corpus(fname, corpus, id2word=None, progress_cnt=10000, metadata=False)
# write out vocabulary
fname_vocab = fname + '.vocab'
logger.info("saving vocabulary of %i words to %s" % (num_terms, fname_vocab))
with open(fname_vocab, 'w') as fout:
with open(fname_vocab, 'wb') as fout:
for featureid in xrange(num_terms):
fout.write("%s\n" % utils.to_utf8(id2word.get(featureid, '---')))

Expand Down
6 changes: 6 additions & 0 deletions gensim/interfaces.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,12 @@ def __iter__(self):
raise NotImplementedError('cannot instantiate abstract base class')


def save(self, *args, **kwargs):
import warnings
warnings.warn("corpus.save() stores only the (tiny) iteration object; "
"to serialize the actual corpus content, use e.g. MmCorpus.serialize(corpus)")
super(CorpusABC, self).save(*args, **kwargs)

def __len__(self):
"""
Return the number of documents in the corpus.
Expand Down
15 changes: 10 additions & 5 deletions gensim/matutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,8 @@ def sparse2full(doc, length):
"""
result = numpy.zeros(length, dtype=numpy.float32) # fill with zeroes (default value)
doc = dict(doc)
result[doc.keys()] = doc.values() # overwrite some of the zeroes with explicit values
# overwrite some of the zeroes with explicit values
result[list(doc)] = list(itervalues(doc))
return result


Expand All @@ -201,7 +202,7 @@ def full2sparse(vec, eps=1e-9):
"""
vec = numpy.asarray(vec, dtype=float)
nnz = numpy.nonzero(abs(vec) > eps)[0]
return zip(nnz, vec.take(nnz))
return list(zip(nnz, vec.take(nnz)))

dense2vec = full2sparse

Expand All @@ -217,7 +218,7 @@ def full2sparse_clipped(vec, topn, eps=1e-9):
vec = numpy.asarray(vec, dtype=float)
nnz = numpy.nonzero(abs(vec) > eps)[0]
biggest = nnz.take(argsort(vec.take(nnz), topn))
return zip(biggest, vec.take(biggest))
return list(zip(biggest, vec.take(biggest)))


def corpus2dense(corpus, num_terms, num_docs=None, dtype=numpy.float32):
Expand All @@ -244,7 +245,7 @@ def corpus2dense(corpus, num_terms, num_docs=None, dtype=numpy.float32):

class Dense2Corpus(object):
"""
Treat dense numpy array as a sparse gensim corpus.
Treat dense numpy array as a sparse, streamed gensim corpus.
No data copy is made (changes to the underlying matrix imply changes in the
corpus).
Expand Down Expand Up @@ -336,6 +337,10 @@ def unitvec(vec):


def cossim(vec1, vec2):
"""
Return cosine similarity between two sparse vectors.
The similarity is a number between <-1.0, 1.0>, higher is more similar.
"""
vec1, vec2 = dict(vec1), dict(vec2)
if not vec1 or not vec2:
return 0.0
Expand Down Expand Up @@ -398,7 +403,7 @@ class MmWriter(object):

def __init__(self, fname):
self.fname = fname
self.fout = open(self.fname, 'w+') # open for both reading and writing
self.fout = open(self.fname, 'wb+') # open for both reading and writing
self.headers_written = False


Expand Down
Loading

0 comments on commit 3520fa3

Please sign in to comment.