Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 4.0.0beta #2993

Merged
merged 230 commits into from
Nov 1, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
230 commits
Select commit Hold shift + click to select a range
f89808d
Merge branch 'master' into develop
mpenkov Sep 23, 2019
2fac325
added release/check_wheels.py (#2610)
mpenkov Sep 29, 2019
26f1e81
Add hacktoberfest-related documentation (#2616)
mpenkov Oct 2, 2019
25f8a42
Fixed #2554 (#2619)
SanthoshBala18 Oct 4, 2019
2131e3a
Properly install Pattern library for documentation build (#2626)
Hiyorimi Oct 8, 2019
a7713aa
Disable Py2.7 builds under Travis, CircleCI and AppVeyor (#2601)
mpenkov Oct 10, 2019
289a6ca
Handling for iterables without 0-th element, fixes #2556 (#2629)
Hiyorimi Oct 10, 2019
3e027c2
Move Py2 deprecation warning to top of changelog (#2627)
mpenkov Oct 11, 2019
e102574
Change find_interlinks return type to list of tuples (#2636)
napsternxg Oct 19, 2019
bcee414
Improve gensim documentation (numfocus) (#2591)
mpenkov Oct 21, 2019
86ed0d8
fix setup.py to get documentation to build under CircleCI (#2650)
mpenkov Oct 24, 2019
e228a93
Fix links to documentation in README.md (#2646)
mpenkov Oct 24, 2019
1894339
Delete requirements.txt (#2648)
mpenkov Oct 24, 2019
e859c11
Remove native Python implementations of Cython extensions (#2630)
mpenkov Oct 25, 2019
34ee98b
replacing deleted notebooks with placeholders (#2654)
mpenkov Oct 29, 2019
ee61691
Document accessing model's vocabulary (#2661)
mpenkov Nov 1, 2019
44ea793
Improve explanation of top_chain_var parameter in Dynamic Topic Model…
joelowj Nov 3, 2019
3d65961
Comment out Hacktober Fest from README (#2677)
piskvorky Nov 11, 2019
f72a55d
Update word2vec2tensor.py (#2678)
kaddynator Nov 18, 2019
1052b9b
Speed up word2vec model loading (#2671)
lopusz Nov 18, 2019
e7c9f0e
Fix local import degrading the performance of word2vec model loading …
lopusz Nov 21, 2019
e391f0c
[Issue-2670] Bug fix: Initialize doc_no2 because it is not set when c…
paulrigor Nov 23, 2019
de0dcc3
Warn when BM25.average_idf < 0 (#2687)
Witiko Dec 2, 2019
36ae46f
Rerun Soft Cosine Measure tutorial notebook (#2691)
Witiko Dec 21, 2019
cc8188c
Fix simple typo: voacab -> vocab (#2719)
timgates42 Jan 1, 2020
12897cb
Fix appveyor builds (#2706)
mpenkov Jan 1, 2020
74a375d
Change similarity strategy when finding n best (#2720)
svinkapeppa Jan 5, 2020
f022028
Initialize self.cfs in Dictionary.compatify method (#2618)
SanthoshBala18 Jan 5, 2020
3d129de
Fix ValueError when instantiating SparseTermSimilarityMatrix (#2689)
ptorrestr Jan 6, 2020
3abcb9f
Refactor bm25 to include model parametrization (cont.) (#2722)
Witiko Jan 8, 2020
fbc7d09
Fix overflow error for `*Vec` corpusfile-based training (#2700)
persiyanov Jan 8, 2020
4d22327
Implement saving to Facebook format (#2712)
lopusz Jan 23, 2020
4710308
Use time.time instead of time.clock in gensim/models/hdpmodel.py (#2730)
tarohi24 Jan 23, 2020
d05259a
better replacement of deprecated .clock()
gojomo Jan 27, 2020
b8346c1
drop py35, add py38 (travis), update explicit dependency versions
gojomo Jan 27, 2020
f5e05d0
better CI logs w/ gdb after core dump
gojomo Jan 27, 2020
0e624c1
improved comments via piskvorky review
gojomo Jan 27, 2020
9352dad
Merge pull request #2715 from gojomo/py38-plus-build-tuning
gojomo Jan 28, 2020
47a0675
rm autogenerated *.cpp files that shouldn't be in source control
gojomo Jan 29, 2020
8d79794
Fix TypeError when using the -m flag (#2734)
Tenoke Jan 30, 2020
b92e087
del cython.sh
gojomo Jan 31, 2020
68ec5b8
Merge pull request #2739 from gojomo/rm-cpp-files
gojomo Feb 24, 2020
0d75f2d
Improve documentation in run_similarity_queries example (#2770)
MartinoMensio Mar 21, 2020
cb3d87c
Fix fastText word_vec() for OOV words with use_norm=True (#2764)
avidale Mar 21, 2020
493e52f
remove mention of py27 (#2751)
mattf Mar 21, 2020
30ca5b3
Fix KeyedVectors.add matrix type (#2761)
menshikh-iv Mar 21, 2020
f767e1e
use collections.abc for Mapping (#2750)
mattf Mar 21, 2020
1b3ad81
Fix out of range issue in gensim.summarization.keywords (#2738)
carterols Mar 21, 2020
a811a23
fixed get_keras_embedding, now accepts word mapping (#2676)
Hamekoded Mar 21, 2020
8a2e2a7
Add downloads badge to README
piskvorky Mar 22, 2020
de0ef26
Get rid of "wheels" badge
piskvorky Mar 22, 2020
a4894bb
link downloads badge to pepy instead of pypi
piskvorky Mar 23, 2020
d952a51
fix broken english in tests (#2773)
piskvorky Mar 23, 2020
ec222e8
fix build, use KeyedVectors class (#2774)
mpenkov Mar 24, 2020
a6247af
cElementTree has been deprecated since Python 3.3 and removed in Pyth…
tirkarthi Mar 30, 2020
a2ec4c3
Fix FastText RAM usage in tests (+ fixes for wheel building) (#2791)
menshikh-iv Apr 13, 2020
10cec93
Fix typo in comments\nThe rows of the corpus are actually documents, …
Chenxin-Guo96 Apr 17, 2020
5b5b545
Add osx+py38 case for avoid multiprocessing issue (#2800)
menshikh-iv Apr 20, 2020
7f194c9
Use nicer twitter badge
piskvorky Apr 22, 2020
db11c14
Use downloads badge from shields.io
piskvorky Apr 22, 2020
188a590
Use blue in badges
piskvorky Apr 22, 2020
63dc990
Remove conda-forge badge
piskvorky Apr 22, 2020
8791bb7
Make twitter badge blue, too
piskvorky Apr 22, 2020
2a04825
Merge branch 'develop' into piskvorky-patch-1
piskvorky Apr 22, 2020
ca726c6
Merge pull request #2772 from RaRe-Technologies/piskvorky-patch-1
piskvorky Apr 22, 2020
68bd860
Cache badges
piskvorky Apr 23, 2020
fd3537a
Use HTML comments instead of Markdown comment
piskvorky Apr 23, 2020
d70b129
Merge pull request #2806 from RaRe-Technologies/piskvorky-patch-1
piskvorky Apr 24, 2020
585b0c0
Merge branch 'develop' into fix-xml
piskvorky Apr 24, 2020
47357de
Merge pull request #2799 from Chenxin-Guo/develop
piskvorky Apr 24, 2020
996801b
Merge pull request #2777 from tirkarthi/fix-xml
piskvorky Apr 24, 2020
29d1092
[MRG] Update README instructions + clean up testing (#2814)
piskvorky May 1, 2020
ace6c34
Add basic yml file for setup pipeline (will fail)
menshikh-iv May 4, 2020
b3b844e
revert back travis
menshikh-iv May 4, 2020
93385d3
Replace AppVeyor by Azure Pipelines (#2824)
menshikh-iv May 6, 2020
d692b9d
Update CHANGELOG.md (#2829)
mpenkov May 7, 2020
0027fb5
Update CHANGELOG.md (#2831)
mpenkov May 9, 2020
ceecef3
Fix-2253: Remove docker folder since it fails to build (#2833)
FyzHsn May 14, 2020
69732eb
LdaModel documentation update -remove claim that it accepts CSC matri…
FyzHsn May 14, 2020
2360459
delete .gitattributes (#2836)
gojomo May 14, 2020
e75f6c8
Fix for Python 3.9/3.10: remove xml.etree.cElementTree (#2846)
hugovk May 24, 2020
8149035
Correct grammar in docs (#2573)
shivdhar Jun 10, 2020
374de28
Don't proxy-cache badges with Google Images (#2854)
piskvorky Jun 15, 2020
42be086
pin keras=2.3.1 because 2.4.3 causes KerasWord2VecWrappper test failu…
gojomo Jun 27, 2020
a74f8e3
Expose max_final_vocab parameter in FastText constructor (#2867)
mpenkov Jun 27, 2020
c888b7a
Replace numpy.random.RandomState with SFC64 - for speed (#2864)
zygm0nt Jun 29, 2020
fff82aa
Update CHANGELOG.md
mpenkov Jun 29, 2020
1228ebe
Clarify that license is LGPL-2.1 (#2871)
pombredanne Jul 18, 2020
78e48b7
Fix travis issues for latest keras versions. (#2869)
dsandeep0138 Jul 18, 2020
4cdf228
Put cell outputs back to the soft cosine measure benchmark notebook (…
Witiko Jul 18, 2020
c0e0169
KeyedVectors & *2Vec API streamlining, consistency (#2698)
gojomo Jul 19, 2020
30af573
Delete .gitattributes
gojomo Jul 21, 2020
5c08d3e
Merge remote-tracking branch 'upstream/develop' into develop
gojomo Jul 21, 2020
3f7047f
test showing FT failure as W2V
gojomo Jul 22, 2020
ac9126d
set .vectors even when ngrams off
gojomo Jul 22, 2020
0316084
use _save_specials/_load_specials per type
gojomo Jul 22, 2020
03c8bb9
Make docs clearer on `alpha` parameter in LDA model
xh2 Jul 24, 2020
7791b74
Merge pull request #1 from xh2/patch-1
xh2 Jul 24, 2020
4e1b09c
Update Hoffman paper link
xh2 Jul 24, 2020
25005c5
rm whitespace
gojomo Jul 26, 2020
f34956c
Update gensim/models/ldamodel.py
piskvorky Jul 26, 2020
7d0ef9e
Update gensim/models/ldamodel.py
piskvorky Jul 26, 2020
a662e8d
Merge pull request #2896 from xh2/bugfix/lda-doc-alpha
piskvorky Jul 26, 2020
78778a9
Update gensim/models/ldamodel.py
piskvorky Jul 26, 2020
344c4ab
Merge pull request #2897 from xh2/bugfix/hoffman-paper-link
piskvorky Jul 26, 2020
b70c826
re-applying changes from #2821
piskvorky Jul 26, 2020
a81e547
migrating + regenerating changed docs
piskvorky Jul 26, 2020
78fe1c4
fix forgotten iteritems
piskvorky Jul 26, 2020
a0e40ca
remove extra `model.wv`
piskvorky Jul 26, 2020
4cf4da0
split overlong doc line
piskvorky Jul 26, 2020
161ad55
get rid of six in doc2vec
piskvorky Jul 27, 2020
31d2b87
increase test timeout for Visdom server
piskvorky Jul 27, 2020
bc95bcb
add 32/64 bits report
gojomo Jul 29, 2020
c834e06
add deprecations for init_sims()
piskvorky Jul 30, 2020
172e37f
remove vectors_norm + add link to migration guide to deprecation warn…
piskvorky Jul 30, 2020
3919b68
rename vectors_norm everywhere, update tests, regen docs
piskvorky Jul 30, 2020
d40f685
put back no-op property setter of deprecated vectors_norm
piskvorky Jul 30, 2020
872c8ed
fix typo
piskvorky Jul 30, 2020
4c1b3f7
fix flake8
piskvorky Jul 30, 2020
b39eec2
disable Keras tests
piskvorky Jul 30, 2020
d5556ea
Merge pull request #2899 from RaRe-Technologies/pr2821
piskvorky Jul 30, 2020
f2fd045
test showing FT failure as W2V
gojomo Jul 22, 2020
7ab1501
set .vectors even when ngrams off
gojomo Jul 22, 2020
ce16168
Update gensim/test/test_fasttext.py
piskvorky Jul 26, 2020
779fe46
Update gensim/test/test_fasttext.py
piskvorky Jul 26, 2020
9289c3b
refresh docs for run_annoy tutorial
piskvorky Aug 3, 2020
4b7e372
Merge pull request #2910 from RaRe-Technologies/rerun_tutorial
piskvorky Aug 3, 2020
b308883
Reduce memory use of the term similarity matrix constructor, deprecat…
Witiko Aug 7, 2020
28a2110
Fix doc2vec crash for large sets of doc-vectors (#2907)
gojomo Aug 17, 2020
817cac9
Fix AttributeError in WikiCorpus (#2901)
jenishah Aug 17, 2020
e9bb3a7
Corrected info about elements of the job queue
lunastera Sep 2, 2020
320cacd
Add unused args of `_update_alpha`
lunastera Sep 2, 2020
fc4b97f
intensify cbow+hs tests; bulk testing method
gojomo Sep 2, 2020
030e650
use increment operator
gojomo Sep 2, 2020
6e0d00b
Change num_words to topn in dtm_coherence (#2926)
MeganStodel Sep 3, 2020
63f977a
Integrate what is essentially the same process
lunastera Sep 4, 2020
d524fa4
Merge branch 'develop' into 2vec_saveload_fixes
piskvorky Sep 7, 2020
49b35b7
docstirng fixes
piskvorky Sep 7, 2020
9cd72f5
Merge pull request #2931 from lunastera/w2v_fix_jobqueue-info
piskvorky Sep 8, 2020
3f972a6
get rid of python2 constructs
piskvorky Sep 8, 2020
bb947b3
Remove Keras dependency (#2937)
piskvorky Sep 10, 2020
4331ccf
code style fixes while debugging pickle model sizes
piskvorky Sep 13, 2020
34e77dc
Merge branch 'pickle_perambulations' into 2vec_saveload_fixes
piskvorky Sep 13, 2020
012d598
py2 to 3: get rid of forgotten range
piskvorky Sep 13, 2020
eefe9ab
fix docs
piskvorky Sep 13, 2020
1a9b646
get rid of numpy.str_
piskvorky Sep 14, 2020
09b7e94
Fix deprecations in SoftCosineSimilarity (#2940)
Witiko Sep 16, 2020
cddf3c1
Fix "generator" language in word2vec docs (#2935)
polm Sep 16, 2020
08a61e5
Bump minimum Python version to 3.6 (#2947)
gojomo Sep 17, 2020
c14456d
Merge remote-tracking branch 'origin/develop' into 2vec_saveload_fixes
piskvorky Sep 19, 2020
06aef75
fix index2entity, fix docs, hard-fail deprecated properties
piskvorky Sep 19, 2020
5e21560
fix typos + more doc fixes + fix failing tests
piskvorky Sep 19, 2020
51cae68
more index2word => index_to_key fixes
piskvorky Sep 19, 2020
17da21e
finish method renaming
piskvorky Sep 19, 2020
f0cade1
Update gensim/models/word2vec.py
piskvorky Sep 19, 2020
6fa5a1b
a few more style fixes
piskvorky Sep 19, 2020
e95ac0a
fix nonsensical word2vec path examples
piskvorky Sep 20, 2020
dc9c3fc
more doc fixes
piskvorky Sep 20, 2020
da8847a
`it` => `itertools`, + code style fixes
piskvorky Sep 24, 2020
c6c24ea
Merge pull request #2939 from RaRe-Technologies/2vec_saveload_fixes
piskvorky Sep 24, 2020
e210f73
Refactor ldamulticore to serialize less data (#2300)
horpto Sep 26, 2020
f0788ad
new docs theme
dvorakvaclav Sep 23, 2020
0f64151
redo copy on web index page
piskvorky Sep 26, 2020
9ddf9a2
fix docs in KeyedVectors
piskvorky Sep 27, 2020
de66bb1
clean up docs structure
piskvorky Sep 28, 2020
65294ec
hopepage header update, social panel and new favicon
dvorakvaclav Sep 29, 2020
469abd7
fix flake8
piskvorky Sep 29, 2020
17f884d
reduce space under code section
piskvorky Sep 29, 2020
e2727c6
Merge pull request #2954 from friendlystudio/new_docs_theme
piskvorky Sep 30, 2020
156c5c0
fix images in core tutorials
piskvorky Sep 30, 2020
0c0f358
Merge remote-tracking branch 'origin/develop' into migrate_tutorials
piskvorky Sep 30, 2020
502b654
WIP: migrating tutorials to 4.0
piskvorky Sep 30, 2020
fd6b408
fix doc2vec tutorial FIXMEs
piskvorky Sep 30, 2020
70d4338
add autogenerated docs
piskvorky Oct 1, 2020
0936e45
fixing flake8 errors
piskvorky Oct 1, 2020
683cebe
Merge pull request #2968 from RaRe-Technologies/migrate_tutorials
piskvorky Oct 1, 2020
2dcaaf8
remove gensim.summarization subpackage, docs and test data (#2958)
mpenkov Oct 3, 2020
8874de1
reuse from test.utils
gojomo Sep 12, 2020
baee8e7
test re-saving-native-FT after update-vocab (#2853)
gojomo Sep 12, 2020
4ca5b78
avoid buggy shared list use (#2943)
gojomo Sep 12, 2020
eab3302
pre-assert save_facebook_model anomaly
gojomo Sep 13, 2020
eba73da
unittest.skipIf instead of pytest.skipIf
gojomo Sep 13, 2020
8e9d202
refactor init/update vectors/vectors_vocab; bulk randomization
gojomo Sep 13, 2020
81b9d14
unify/correct Word2Vec & FastText corpus/train parameter checking
gojomo Sep 14, 2020
bcf4f1e
suggestions from code review
gojomo Sep 15, 2020
a51818b
improve train() corpus_iterable parameter doc-comment
gojomo Sep 16, 2020
8687e7f
disable pytest-rerunfailures due to https://github.com/pytest-dev/pyt…
gojomo Sep 28, 2020
dda970e
comment clarity from review
gojomo Oct 6, 2020
e090400
specify dtype to avoid interim float64
gojomo Oct 6, 2020
1edbb4c
use inefficient-but-all-tests-pass 'uniform' for now, w/ big FIXME co…
gojomo Oct 6, 2020
b40c601
refactor phrases
piskvorky Oct 8, 2020
02354cd
float32 random; diversified dv seed; disable bad test
gojomo Oct 8, 2020
b2a5a0d
double-backticks
gojomo Oct 10, 2020
1c59aad
inline seed diversifier; unittest.skip
gojomo Oct 10, 2020
6f4053b
fix phrases tests
piskvorky Oct 10, 2020
092b512
clean up rendered docs for phrases
piskvorky Oct 10, 2020
1acb47c
fix sklearn_api.phrases tests + docs
piskvorky Oct 10, 2020
aaa79dd
fix flake8 warnings in docstrings
piskvorky Oct 10, 2020
0596dbd
Merge pull request #2976 from RaRe-Technologies/fix_phrases
piskvorky Oct 10, 2020
8166081
rename export_phrases to find_phrases + add actual export_phrases
piskvorky Oct 10, 2020
4879e52
skip common english words by default in phrases
piskvorky Oct 10, 2020
9e503a4
sphinx doesn't allow custom section titles :(
piskvorky Oct 10, 2020
9cd75c3
use FIXME for comments/doc-comments/names that must change pre-4.0.0
gojomo Oct 10, 2020
2784599
ignore conjunctions in phrases
piskvorky Oct 11, 2020
6baaa74
make ENGLISH_COMMON_TERMS optional
piskvorky Oct 12, 2020
b00b393
fix typo
piskvorky Oct 12, 2020
7c17577
docs: use full version as the "short version"
piskvorky Oct 12, 2020
75caa93
phrases: rename common_terms => connector_words
piskvorky Oct 12, 2020
22f4bc2
fix typo
piskvorky Oct 12, 2020
15d8261
ReST does not support nested markup
piskvorky Oct 12, 2020
c3b7f97
make flake8 shut up
piskvorky Oct 12, 2020
d6fc1b1
improve HTML doc formatting for consecutive paragraphs
piskvorky Oct 14, 2020
1d0a8bd
Merge pull request #2979 from RaRe-Technologies/phrases_common_words
piskvorky Oct 14, 2020
ea87470
Merge pull request #2944 from gojomo/ft_save_after_update_vocab
piskvorky Oct 15, 2020
4a2548a
fix typos
piskvorky Oct 18, 2020
c8d59b0
add benchmark script
piskvorky Oct 18, 2020
8d7dde2
silence flake8
piskvorky Oct 18, 2020
87ad617
Merge pull request #2982 from RaRe-Technologies/fix_2887
piskvorky Oct 18, 2020
839b1d3
remove dependency on `six`
piskvorky Oct 18, 2020
86fe8ef
regen tutorials
piskvorky Oct 18, 2020
94a227b
Merge pull request #2984 from RaRe-Technologies/remove_six
piskvorky Oct 19, 2020
0d1d054
Notification at the top of page in documentation
dvorakvaclav Oct 26, 2020
b0b2e38
Update notification.html
piskvorky Oct 26, 2020
60a8f7f
Merge pull request #2992 from friendlystudio/docs_notification
piskvorky Oct 26, 2020
e4199cb
Update changelog for 4.0.0 release (#2981)
mpenkov Oct 28, 2020
329adf2
bumped version to 4.0.0beta
mpenkov Oct 30, 2020
371e2c5
remove reference to cython.sh
mpenkov Oct 30, 2020
4895c64
update link in readme
mpenkov Oct 30, 2020
18d519c
Merge branch 'master' into release-4.0.0beta
mpenkov Oct 31, 2020
3918704
clean up merge artifact
mpenkov Oct 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
phrases: rename common_terms => connector_words
  • Loading branch information
piskvorky committed Oct 12, 2020
commit 75caa931444f386b9ae00ff7974bd8d1cc277034
112 changes: 45 additions & 67 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_COMMON_TERMS
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Create training corpus. Must be a sequence of sentences (e.g. an iterable or a generator).
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
Expand All @@ -31,7 +31,7 @@
['computer', 'human', 'interface', 'computer', 'response', 'survey', 'system', 'time', 'user', 'interface']
>>>
>>> # Train a toy phrase model on our training corpus.
>>> phrase_model = Phrases(sentences, min_count=1, threshold=1, common_terms=ENGLISH_COMMON_TERMS)
>>> phrase_model = Phrases(sentences, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> # Apply the trained phrases model to a new, unseen sentence.
>>> new_sentence = ['trees', 'graph', 'minors']
Expand Down Expand Up @@ -75,10 +75,10 @@

NEGATIVE_INFINITY = float('-inf')

# Set of common English words. Tokens from this set are "ignored" during phrase detection:
# 1) Phrases may not start with these words.
# Words from this set are "ignored" during phrase detection:
# 1) Phrases may not start nor end with these words.
# 2) Phrases may include any number of these words inside.
ENGLISH_COMMON_TERMS = frozenset(
ENGLISH_CONNECTOR_WORDS = frozenset(
" a an the " # articles; we never care about these in MWEs
" for of with without at from to in on by " # prepositions; incomplete on purpose, to minimize FNs
" and or " # conjunctions; incomplete on purpose, to minimize FNs
Expand Down Expand Up @@ -209,8 +209,8 @@ class _PhrasesTransformation(interfaces.TransformationABC):
:class:`~gensim.models.phrases.FrozenPhrases`.

"""
def __init__(self, common_terms):
self.common_terms = frozenset(common_terms)
def __init__(self, connector_words):
self.connector_words = frozenset(connector_words)

def score_candidate(self, word_a, word_b, in_between):
"""Score a single phrase candidate.
Expand Down Expand Up @@ -241,8 +241,8 @@ def analyze_sentence(self, sentence):
"""
start_token, in_between = None, []
for word in sentence:
if word not in self.common_terms:
# The current word is a normal token, not a stop word, which means it's a potential
if word not in self.connector_words:
# The current word is a normal token, not a connector word, which means it's a potential
# beginning (or end) of a phrase.
if start_token:
# We're inside a potential phrase, of which this word is the end.
Expand All @@ -258,14 +258,14 @@ def analyze_sentence(self, sentence):
yield w, None
start_token, in_between = word, [] # new potential phrase starts here
else:
# Not inside a potential bigram yet; start a new potential bigram here.
# Not inside a phrase yet; start a new phrase candidate here.
start_token, in_between = word, []
else: # We're a stop word.
else: # We're a connector word.
if start_token:
# We're inside a potential bigram: add the stopword and keep growing the phrase.
# We're inside a potential phrase: add the connector word and keep growing the phrase.
in_between.append(word)
else:
# Not inside a bigram: emit the stopword and move on. Phrases never begin with a stopword.
# Not inside a phrase: emit the connector word and move on.
yield word, None
# Emit any non-phrase tokens at the end.
if start_token:
Expand Down Expand Up @@ -320,10 +320,10 @@ def find_phrases(self, sentences):

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_COMMON_TERMS
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=0.1, common_terms=ENGLISH_COMMON_TERMS)
>>> phrases = Phrases(sentences, min_count=1, threshold=0.1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> for phrase, score in phrases.find_phrases(sentences).items():
... print(phrase, score)
Expand Down Expand Up @@ -389,13 +389,17 @@ def load(cls, *args, **kwargs):
model.scoring = npmi_scorer
else:
raise ValueError(f'failed to load {cls.__name__} model, unknown scoring "{model.scoring}"')
# Initialize new attributes to default values.
if not hasattr(model, "common_terms"):

# common_terms didn't exist pre-3.?, and was renamed to connector in 4.0.0.
if hasattr(model, "common_terms"):
model.connector_words = model.common_terms
del model.common_terms
else:
logger.warning(
'older version of %s loaded without common_terms attribute, setting it to empty set',
'older version of %s loaded without common_terms attribute, setting connector_words to an empty set',
cls.__name__,
)
model.common_terms = frozenset()
model.connector_words = frozenset()

if not hasattr(model, 'corpus_word_count'):
logger.warning('older version of %s loaded without corpus_word_count', cls.__name__)
Expand Down Expand Up @@ -423,7 +427,7 @@ class Phrases(_PhrasesTransformation):
def __init__(
self, sentences=None, min_count=5, threshold=10.0,
max_vocab_size=40000000, delimiter='_', progress_per=10000,
scoring='default', common_terms=frozenset(),
scoring='default', connector_words=frozenset(),
):
"""

Expand Down Expand Up @@ -453,27 +457,29 @@ def __init__(

#. "default" - :func:`~gensim.models.phrases.original_scorer`.
#. "npmi" - :func:`~gensim.models.phrases.npmi_scorer`.
common_terms : set of str, optional
Set of "stop words" that may be included within a phrase, without affecting its scoring.
connector_words : set of str, optional
Set of words that may be included within a phrase, without affecting its scoring.
No phrase can start nor end with a connector word; a phrase may contain any number of
connector words in the middle.

**If your texts are in English, set ``common_terms=phrases.ENGLISH_COMMON_TERMS``.**
This will cause phrases to include common English articles and prepositions, such
as `bank_of_america` or `eye_of_the_beholder`.
**If your texts are in English, set ``connector_words=phrases.ENGLISH_CONNECTOR_WORDS``.**
This will cause phrases to include common English articles, prepositions and
conjuctions, such as `bank_of_america` or `eye_of_the_beholder`.

For other languages or specific applications domains, use custom ``common_terms``
that make sense there: ``common_terms=frozenset("der die das".split())`` etc.
For other languages or specific applications domains, use custom ``connector_words``
that make sense there: ``connector_words=frozenset("der die das".split())`` etc.

Examples
--------
.. sourcecode:: pycon

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_COMMON_TERMS
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Load corpus and train a model.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=1, common_terms=ENGLISH_COMMON_TERMS)
>>> phrases = Phrases(sentences, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> # Use the model to detect phrases in a new sentence.
>>> sent = [u'trees', u'graph', u'minors']
Expand Down Expand Up @@ -514,7 +520,7 @@ def __init__(
The scoring function **must be pickleable**.

"""
super().__init__(common_terms=common_terms)
super().__init__(connector_words=connector_words)
if min_count <= 0:
raise ValueError("min_count should be at least 1")

Expand Down Expand Up @@ -569,36 +575,8 @@ def __str__(self):
)

@staticmethod
def _learn_vocab(
sentences, max_vocab_size, delimiter='_', common_terms=frozenset(), progress_per=10000,
):
"""Collect unigram and bigram counts from the `sentences` iterable.

Parameters
----------
sentences : iterable of list of str
The `sentences` iterable can be simply a list, but for larger corpora, consider a generator that streams
the sentences directly from disk/network, See :class:`~gensim.models.word2vec.BrownCorpus`,
:class:`~gensim.models.word2vec.Text8Corpus` or :class:`~gensim.models.word2vec.LineSentence`
for such examples.
max_vocab_size : int
Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words,
to keep memory under control. 40M needs about 3.6GB of RAM. Increase/decrease
`max_vocab_size` depending on how much available memory you have.
delimiter : str, optional
Glue character used to join collocation tokens.
common_terms : set of str, optional
List of "stop words" that won't affect frequency count of phrases containing them.
Allow to detect phrases like "bank_of_america" or "eye_of_the_beholder".
progress_per : int
Log progress once every `progress_per` sentences.

Return
------
(int, dict of (str, int), int)
Number of pruned words, counters for each word/bi-gram, and total number of words.

"""
def _learn_vocab(sentences, max_vocab_size, delimiter, connector_words, progress_per):
"""Collect unigram and bigram counts from the `sentences` iterable."""
sentence_no, total_words, min_reduce = -1, 0, 1
vocab = defaultdict(int)
logger.info("collecting all words and their counts")
Expand All @@ -610,7 +588,7 @@ def _learn_vocab(
)
start_token, in_between = None, []
for word in sentence:
if word not in common_terms:
if word not in connector_words:
vocab[word] += 1
if start_token is not None:
phrase_tokens = itertools.chain([start_token], in_between, [word])
Expand Down Expand Up @@ -644,11 +622,11 @@ def add_vocab(self, sentences):

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_COMMON_TERMS
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Train a phrase detector from a text corpus.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, common_terms=ENGLISH_COMMON_TERMS) # train model
>>> phrases = Phrases(sentences, connector_words=ENGLISH_CONNECTOR_WORDS) # train model
>>> assert len(phrases.vocab) == 37
>>>
>>> more_sentences = [
Expand All @@ -667,7 +645,7 @@ def add_vocab(self, sentences):
# counts collected in previous learn_vocab runs.
min_reduce, vocab, total_words = self._learn_vocab(
sentences, max_vocab_size=self.max_vocab_size, delimiter=self.delimiter,
progress_per=self.progress_per, common_terms=self.common_terms,
progress_per=self.progress_per, connector_words=self.connector_words,
)

self.corpus_word_count += total_words
Expand Down Expand Up @@ -775,11 +753,11 @@ def __init__(self, phrases_model):

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_COMMON_TERMS
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Load corpus and train a model.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=1, common_terms=ENGLISH_COMMON_TERMS)
>>> phrases = Phrases(sentences, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> # Export a FrozenPhrases object that is more efficient but doesn't allow further training.
>>> frozen_phrases = phrases.freeze()
Expand All @@ -791,7 +769,7 @@ def __init__(self, phrases_model):
self.min_count = phrases_model.min_count
self.delimiter = phrases_model.delimiter
self.scoring = phrases_model.scoring
self.common_terms = phrases_model.common_terms
self.connector_words = phrases_model.connector_words
logger.info('exporting phrases from %s', phrases_model)
self.phrasegrams = phrases_model.export_phrases()
logger.info('exported %s', self)
Expand Down
26 changes: 17 additions & 9 deletions gensim/sklearn_api/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
from sklearn.exceptions import NotFittedError

from gensim import models
from gensim.models.phrases import FrozenPhrases
from gensim.models.phrases import FrozenPhrases, ENGLISH_CONNECTOR_WORDS


class PhrasesTransformer(TransformerMixin, BaseEstimator):
Expand All @@ -46,7 +46,7 @@ class PhrasesTransformer(TransformerMixin, BaseEstimator):
"""
def __init__(
self, min_count=5, threshold=10.0, max_vocab_size=40000000,
delimiter='_', progress_per=10000, scoring='default', common_terms=frozenset(),
delimiter='_', progress_per=10000, scoring='default', connector_words=frozenset(),
):
"""

Expand Down Expand Up @@ -90,9 +90,17 @@ def __init__(

A scoring function without any of these parameters (even if the parameters are not used) will
raise a ValueError on initialization of the Phrases class. The scoring function must be pickleable.
common_terms : set of str, optional
List of "stop words" that won't affect frequency count of expressions containing them.
Allow to detect expressions like "bank_of_america" or "eye_of_the_beholder".
connector_words : set of str, optional
Set of words that may be included within a phrase, without affecting its scoring.
No phrase can start nor end with a connector word; a phrase may contain any number of
connector words in the middle.

**If your texts are in English, set ``connector_words=phrases.ENGLISH_CONNECTOR_WORDS``.**
This will cause phrases to include common English articles, prepositions and
conjuctions, such as `bank_of_america` or `eye_of_the_beholder`.

For other languages or specific applications domains, use custom ``connector_words``
that make sense there: ``connector_words=frozenset("der die das".split())`` etc.

"""
self.gensim_model = None
Expand All @@ -103,11 +111,11 @@ def __init__(
self.delimiter = delimiter
self.progress_per = progress_per
self.scoring = scoring
self.common_terms = common_terms
self.connector_words = connector_words

def __setstate__(self, state):
self.__dict__ = state
self.common_terms = frozenset()
self.connector_words = frozenset()
self.phraser = None

def fit(self, X, y=None):
Expand All @@ -127,7 +135,7 @@ def fit(self, X, y=None):
self.gensim_model = models.Phrases(
sentences=X, min_count=self.min_count, threshold=self.threshold,
max_vocab_size=self.max_vocab_size, delimiter=self.delimiter,
progress_per=self.progress_per, scoring=self.scoring, common_terms=self.common_terms
progress_per=self.progress_per, scoring=self.scoring, connector_words=self.connector_words,
)
self.phraser = FrozenPhrases(self.gensim_model)
return self
Expand Down Expand Up @@ -184,7 +192,7 @@ def partial_fit(self, X):
self.gensim_model = models.Phrases(
sentences=X, min_count=self.min_count, threshold=self.threshold,
max_vocab_size=self.max_vocab_size, delimiter=self.delimiter,
progress_per=self.progress_per, scoring=self.scoring, common_terms=self.common_terms
progress_per=self.progress_per, scoring=self.scoring, connector_words=self.connector_words,
)

self.gensim_model.add_vocab(X)
Expand Down
Loading