Skip to content

Commit

Permalink
added set_expansion.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
shira-g committed Jul 25, 2018
1 parent 8cbcfdb commit f2e9ef8
Show file tree
Hide file tree
Showing 3 changed files with 124 additions and 0 deletions.
Binary file added doc/source/assets/expansion_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,13 @@ on this project, please see the :doc:`developer guide <developer_guide>`.
memn2n.rst
kvmemn2n.rst

.. toctree::
:hidden:
:maxdepth: 1
:caption: Solutions

set_expansion.rst


.. toctree::
:hidden:
Expand Down
117 changes: 117 additions & 0 deletions doc/source/set_expansion.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
.. ---------------------------------------------------------------------------
.. Copyright 2016-2018 Intel Corporation
..
.. Licensed under the Apache License, Version 2.0 (the "License");
.. you may not use this file except in compliance with the License.
.. You may obtain a copy of the License at
..
.. http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS,
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. See the License for the specific language governing permissions and
.. limitations under the License.
.. ---------------------------------------------------------------------------
Set Expansion Solution
######################

Overview
========
Term set expansion is the task of expanding a given partial set of terms into
a more complete set of terms that belong to the same semantic class. This
solution demonstrates the capability of a corpus-based set expansion system
in a simple web application.

.. image :: assets/expansion_demo.png
Algorithm Overview
==================
Our approach is based on representing any term of a training corpus using word embeddings in order
to estimate the similarity between the seed terms and any candidate term. Noun phrases provide
good approximation for candidate terms and are extracted in our system using a noun phrase chunker.
At expansion time, given a seed of terms, the most similar terms are returned.


Training
========

The first step in training is to prepare the data for generating a word embedding model. We
provide a subset of English Wikipedia at datasets/wikipedia as a sample corpus under the
`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__.
This is done by running:

.. code:: python
python solutions/set_expansion/prepare_data.py --corpus TRAINING_CORPUS --marked_corpus MARKED_TRAINING_CORPUS
The next step is to train the model using `NLP Architect np2vec module <http://nlp_architect.nervanasys.com/np2vec.html>`__.
For set expansion, we recommend the following values 100, 10, 10, 0 for respectively,
size, min_count, window and hs hyperparameters.

.. code:: python
python examples/np2vec/train.py --size 100 --min_count 10 --window 10 --hs 0 --corpus MARKED_TRAINING_CORPUS --np2vec_model_file MODEL_PATH --corpus_format txt
A pretrained model on English Wikipedia dump (enwiki-20171201-pages-articles-multistream.xml
.bz2) is provided at ... under Apache2.0 license. It has been trained with hyperparameters values
recommended above.


Inference
=========

It consists in expanding the seed terms. It can be done in two ways:

1. Running a python script:

.. code:: python
python solutions/set_expansion/set_expand.py --np2vec_model_file MODEL_PATH --topn TOPN
2. Web application

A. Loading the expand server with the trained model:

.. code:: python
python expand_server.py [--host HOST] [--port PORT] model_path
The expand server gets requests containing seed terms, and expands them
based on the given word embedding model. You can use the model you trained
yourself in the previous step, or to provide a pre-trained model you own.
**Important note**: default server
will listen on localhost:1234. If you set the host/port you should also
set it in the ui/settings.py file.


B. Run the UI application:

.. code:: python
bokeh serve --show ui
The UI is a simple web based application for performing expansion.
The application communicates with the server by sending expand
requests, present the results in a simple table and export them to a csv
file. It allows you to either directly type the terms to expand or to
select terms from the model vocabulary list. After you get some expand
results you can perform re-expansion by selecting terms from the results (hold the Ctrl key for
multiple selection). **Important note**: If you set the host/port of the expand server you
should also set it in the ui/settings.py file. You can also load the ui
application as a server using the bokeh options --address and --port, for example:

.. code:: python
bokeh serve ui --address=12.13.14.15 --port=1010 --allow-websocket-origin=12.13.14.15:1010
Citation
========

`Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow <https://drive.google.com/open?id=164MvUGo0-iPeuGM1b8XrH2ysZZFrzomF>`__, Jonathan Mamou,
Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin,
Peter Izsak, Daniel Korat, COLING 2018.

0 comments on commit f2e9ef8

Please sign in to comment.