Skip to content

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Aug 25, 2016

  • avoids an unnecessary duplicate giant dictionary in common case of a single add_vocab() call
  • adds new Phraser helper class & supporting options on existing methods

Phraser takes a Phrases and does a single (time-consuming) pass to discover all the Phrases that it would want to create, saving those into a much much smaller (and somewhat faster) helper object. This can be saved & used separately.

Needs more testing.

@piskvorky
Copy link
Owner

Note to self: use the faster detection loop in Python.

@lev good student project: implement the inner loops as C (via Cython) optional extension (optional ala word2vec extension).

@piskvorky
Copy link
Owner

@gojomo @lev ready to merge? What's missing?

@tmylk tmylk mentioned this pull request Oct 4, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 4, 2016

Created #918 asking for volunteers to create tests

@tmylk tmylk added the difficulty easy Easy issue: required small fix label Oct 4, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 16, 2016

Merged in #954

@tmylk tmylk closed this Oct 16, 2016
@tmylk tmylk deleted the Phrases__opt branch January 13, 2017 02:23
@tmylk
Copy link
Contributor

tmylk commented Jan 13, 2017

@gojomo Should Phrases be deprecated? The existing warning about its slowness doesn't seem to deter users.

@gojomo
Copy link
Collaborator Author

gojomo commented Jan 14, 2017

@tmylk It's necessary to use a Phrases to construct a Phraser. Phraser takes a bunch of extra time to create, but then is slightly faster but much more compact. (The main benefit is memory, not speed, and certainly not speed-to-first-use, which is worse.)

Also, a Phraser essentially locks in the effective min_count and threshold values at the time of its creation – so if you want to try other values, you need to go back to a Phrases with the full frequency data.

So Phrases can't quite be deprecated or hidden... perhaps some renaming/refactoring and updated docs/examples would guide people to the best options for them. There's also the potential for using more memory-efficient approximate counts, to use less memory during initial Phrases analysis. I believe one or two folks have started down that road before, but their PRs never reached mergeable state. If there's someone with the interest/skills to tackle that, they might want to, at the same time, try to rationalize the interfaces/names for better clarity.

@piskvorky
Copy link
Owner

piskvorky commented Jan 15, 2017

@tmylk a fast C/Cython implementation of Phrases, ideally multicore, would be a great student project.

The entire "training" is essentially incrementing a counter, no reason it shouldn't be as fast as your input iterator provides.

Plus, phrases (collocations) are not going anywhere, so it's a stable module to invest more time into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants