diff --git a/docs/api/python/index.md b/docs/api/python/index.md index 75ff186fd81d..7a3ad7c03c64 100644 --- a/docs/api/python/index.md +++ b/docs/api/python/index.md @@ -98,6 +98,15 @@ imported by running: io/io.md ``` +## Text API + +```eval_rst +.. toctree:: + :maxdepth: 1 + + text/text.md +``` + ## Image API ```eval_rst diff --git a/docs/api/python/text/text.md b/docs/api/python/text/text.md new file mode 100644 index 000000000000..3b70b76d94d3 --- /dev/null +++ b/docs/api/python/text/text.md @@ -0,0 +1,455 @@ +# Text API + +## Overview + +The mxnet.contrib.text APIs refer to classes and functions related to text data +processing, such as bulding indices and loading pre-trained embedding vectors +for text tokens and storing them in the `mxnet.ndarray.NDArray` format. + +```eval_rst +.. warning:: This package contains experimental APIs and may change in the near future. +``` + +This document lists the text APIs in mxnet: + +```eval_rst +.. autosummary:: + :nosignatures: + + mxnet.contrib.text.glossary + mxnet.contrib.text.embedding + mxnet.contrib.text.indexer + mxnet.contrib.text.utils +``` + +All the code demonstrated in this document assumes that the following modules +or packages are imported. + +```python +>>> from mxnet import gluon +>>> from mxnet import nd +>>> from mxnet.contrib import text +>>> import collections + +``` + +### Look up pre-trained word embeddings for indexed words + +As a common use case, let us look up pre-trained word embedding vectors for +indexed words in just a few lines of code. To begin with, we can create a +fastText word embedding object by specifying the embedding name `fasttext` and +the pre-trained file `wiki.simple.vec`. + +```python +>>> fasttext_simple = text.embedding.TokenEmbedding.create('fasttext', +... pretrained_file_name='wiki.simple.vec') + +``` + +Suppose that we have a simple text data set in the string format. We can count +word frequency in the data set. + +```python +>>> text_data = " hello world \n hello nice world \n hi world \n" +>>> counter = text.utils.count_tokens_from_str(text_data) + +``` + +The obtained `counter` has key-value pairs whose keys are words and values are +word frequencies. Suppose that we want to build indices for all the keys in +`counter` and load the defined fastText word embedding for all such indexed +words. First, we need a TokenIndexer object with `counter` as its argument + +```python +>>> token_indexer = text.indexer.TokenIndexer(counter) + +``` + +Then, we can create a Glossary object by specifying `token_indexer` and `fasttext_simple` as its +arguments. + +```python +>>> glossary = text.glossary.Glossary(token_indexer, fasttext_simple) + +``` + +Now we are ready to look up the fastText word embedding vectors for indexed +words. + +```python +>>> glossary.get_vecs_by_tokens(['hello', 'world']) + +[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 + ... + -7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] + [ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 + ... + -3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] + + +``` + +### Use `glossary` in `gluon` + +To demonstrate how to use a glossary with the loaded word embedding in the +`gluon` package, let us first obtain indices of the words 'hello' and 'world'. + +```python +>>> glossary.to_indices(['hello', 'world']) +[2, 1] + +``` + +We can obtain the vector representation for the words 'hello' and 'world' +by specifying their indices (2 and 1) and the `glossary.idx_to_vec` in +`mxnet.gluon.nn.Embedding`. + +```python +>>> layer = gluon.nn.Embedding(len(glossary), glossary.vec_len) +>>> layer.initialize() +>>> layer.weight.set_data(glossary.idx_to_vec) +>>> layer(nd.array([2, 1])) + +[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 + ... + -7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] + [ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 + ... + -3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] + + +``` + + +## Glossary + +The glossary provides indexing and embedding for text tokens in a glossary. For +each indexed token in a glossary, an embedding vector will be associated with +it. Such embedding vectors can be loaded from externally hosted or custom +pre-trained token embedding files, such as via instances of +[`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding). +The input counter whose keys are +candidate indices may be obtained via +[`count_tokens_from_str`](#mxnet.contrib.text.utils.count_tokens_from_str). + +```eval_rst +.. currentmodule:: mxnet.contrib.text.glossary +.. autosummary:: + :nosignatures: + + Glossary +``` + +To get all the valid names for pre-trained embeddings and files, we can use +[`TokenEmbedding.get_embedding_and_pretrained_file_names`](#mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names). + +```python +>>> text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names() +{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', +'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt', +'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt', +'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt'], +'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec']} + +``` + +To begin with, we can create a fastText word embedding object by specifying the +embedding name `fasttext` and the pre-trained file `wiki.simple.vec`. + +```python +>>> fasttext_simple = text.embedding.TokenEmbedding.create('fasttext', +... pretrained_file_name='wiki.simple.vec') + +``` + +Suppose that we have a simple text data set in the string format. We can count +word frequency in the data set. + +```python +>>> text_data = " hello world \n hello nice world \n hi world \n" +>>> counter = text.utils.count_tokens_from_str(text_data) + +``` + +The obtained `counter` has key-value pairs whose keys are words and values are +word frequencies. Suppose that we want to build indices for the most frequent 2 +keys in `counter` and load the defined fastText word embedding for all these +2 words. + +```python +>>> token_indexer = text.indexer.TokenIndexer(counter, most_freq_count=2) +>>> glossary = text.glossary.Glossary(token_indexer, fasttext_simple) + +``` + +Now we are ready to look up the fastText word embedding vectors for indexed +words. + +```python +>>> glossary.get_vecs_by_tokens(['hello', 'world']) + +[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 + ... + -7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] + [ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 + ... + -3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] + + +``` + +We can also access properties such as `token_to_idx` (mapping tokens to +indices), `idx_to_token` (mapping indices to tokens), and `vec_len` +(length of each embedding vector). + +```python +>>> glossary.token_to_idx +{'': 0, 'world': 1, 'hello': 2, 'hi': 3, 'nice': 4} +>>> glossary.idx_to_token +['', 'world', 'hello', 'hi', 'nice'] +>>> len(glossary) +5 +>>> glossary.vec_len +300 + +``` + +If a token is unknown to `glossary`, its embedding vector is initialized +according to the default specification in `fasttext_simple` (all elements are +0). + +```python + +>>> glossary.get_vecs_by_tokens('unknownT0kEN') + +[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. + ... + 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] + + +``` + +## Text token embedding + +The text token embedding builds indices for text tokens. Such indexed tokens can +be used by instances of [`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding) +and [`Glossary`](#mxnet.contrib.text.glossary.Glossary). + +To load token embeddings from an externally hosted pre-trained token embedding +file, such as those of GloVe and FastText, use +[`TokenEmbedding.create(embedding_name, pretrained_file_name)`](#mxnet.contrib.text.embedding.TokenEmbedding.create). +To get all the available `embedding_name` and `pretrained_file_name`, use +[`TokenEmbedding.get_embedding_and_pretrained_file_names()`](#mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names). + +Alternatively, to load embedding vectors from a custom pre-trained text token +embedding file, use [`CustomEmbedding`](#mxnet.contrib.text.embedding.CustomEmbedding). + + +```eval_rst +.. currentmodule:: mxnet.contrib.text.embedding +.. autosummary:: + :nosignatures: + + TokenEmbedding + GloVe + FastText + CustomEmbedding +``` + +To get all the valid names for pre-trained embeddings and files, we can use +[`TokenEmbedding.get_embedding_and_pretrained_file_names`](#mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names). + +```python +>>> text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names() +{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', +'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt', +'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt', +'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt'], +'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec']} + +``` + +To begin with, we can create a GloVe word embedding object by specifying the +embedding name `glove` and the pre-trained file `glove.6B.50d.txt`. The +argument `init_unknown_vec` specifies default vector representation for any +unknown token. + +```python +>>> glove_6b_50d = text.embedding.TokenEmbedding.create('glove', +... pretrained_file_name='glove.6B.50d.txt', init_unknown_vec=nd.zeros) + +``` + +We can access properties such as `token_to_idx` (mapping tokens to indices), +`idx_to_token` (mapping indices to tokens), `vec_len` (length of each embedding +vector), and `unknown_token` (representation of any unknown token, default +value is ''). + +```python +>>> glove_6b_50d.token_to_idx['hi'] +11084 +>>> glove_6b_50d.idx_to_token[11084] +'hi' +>>> glove_6b_50d.vec_len +50 +>>> glove_6b_50d.unknown_token +'' + +``` + +For every unknown token, if its representation '' is encountered in the +pre-trained token embedding file, index 0 of property `idx_to_vec` maps to the +pre-trained token embedding vector loaded from the file; otherwise, index 0 of +property `idx_to_vec` maps to the default token embedding vector specified via +`init_unknown_vec` (set to nd.zeros here). Since the pre-trained file +does not have a vector for the token '', index 0 has to map to an +additional token '' and the number of tokens in the embedding is 400,001. + + +```python +>>> len(glove_6b_50d) +400001 +>>> glove_6b_50d.idx_to_vec[0] + +[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. + ... + 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] + +>>> glove_6b_50d.get_vecs_by_tokens('unknownT0kEN') + +[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. + ... + 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] + +>>> glove_6b_50d.get_vecs_by_tokens(['unknownT0kEN', 'unknownT0kEN']) + +[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. + ... + 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] + [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. + ... + 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] + + +``` + + +### Implement a new text token embedding + +For ``optimizer``, create a subclass of +[`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding). +Also add ``@TokenEmbedding.register`` before this class. See +[`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/text/embedding.py) +for examples. + + +## Text token indexer + +The text token indexer builds indices for text tokens. Such indexed tokens can +be used by instances of [`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding) +and [`Glossary`](#mxnet.contrib.text.glossary.Glossary). The input +counter whose keys are candidate indices may be obtained via +[`count_tokens_from_str`](#mxnet.contrib.text.utils.count_tokens_from_str). + + +```eval_rst +.. currentmodule:: mxnet.contrib.text.indexer +.. autosummary:: + :nosignatures: + + TokenIndexer +``` + +Suppose that we have a simple text data set in the string format. We can count +word frequency in the data set. + +```python +>>> text_data = " hello world \n hello nice world \n hi world \n" +>>> counter = text.utils.count_tokens_from_str(text_data) + +``` + +The obtained `counter` has key-value pairs whose keys are words and values are +word frequencies. Suppose that we want to build indices for the 2 most frequent +keys in `counter` with the unknown token representation '' and a reserved +token ''. + +```python +>>> token_indexer = text.indexer.TokenIndexer(counter, most_freq_count=2, +... unknown_token='', reserved_tokens=['']) + +``` + +We can access properties such as `token_to_idx` (mapping tokens to indices), +`idx_to_token` (mapping indices to tokens), `vec_len` (length of each embedding +vector), and `unknown_token` (representation of any unknown token) and +`reserved_tokens`. + +```python +>>> token_indexer = text.indexer.TokenIndexer(counter, most_freq_count=2, +... unknown_token='', reserved_tokens=['']) + +``` + +```python +>>> token_indexer.token_to_idx +{'': 0, '': 1, 'world': 2, 'hello': 3} +>>> token_indexer.idx_to_token +['', '', 'world', 'hello'] +>>> token_indexer.unknown_token +'' +>>> token_indexer.reserved_tokens +[''] +>>> len(token_indexer) +4 +``` + +Besides the specified unknown token '' and reserved_token '' are +indexed, the 2 most frequent words 'world' and 'hello' are also indexed. + + + +## Text utilities + +The following functions provide utilities for text data processing. + +```eval_rst +.. currentmodule:: mxnet.contrib.text.utils +.. autosummary:: + :nosignatures: + + count_tokens_from_str +``` + + + + +## API Reference + + + +```eval_rst + +.. automodule:: mxnet.contrib.text.glossary +.. autoclass:: mxnet.contrib.text.glossary.Glossary + :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens + +.. automodule:: mxnet.contrib.text.embedding +.. autoclass:: mxnet.contrib.text.embedding.TokenEmbedding + :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens, register, create, get_embedding_and_pretrained_file_names +.. autoclass:: mxnet.contrib.text.embedding.GloVe + :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens +.. autoclass:: mxnet.contrib.text.embedding.FastText + :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens +.. autoclass:: mxnet.contrib.text.embedding.CustomEmbedding + :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens + +.. automodule:: mxnet.contrib.text.indexer +.. autoclass:: mxnet.contrib.text.indexer.TokenIndexer + :members: to_indices, to_tokens + +.. automodule:: mxnet.contrib.text.utils + :members: count_tokens_from_str + +``` + \ No newline at end of file diff --git a/python/mxnet/contrib/text/embedding.py b/python/mxnet/contrib/text/embedding.py index adba86722390..54635f1e8cf9 100644 --- a/python/mxnet/contrib/text/embedding.py +++ b/python/mxnet/contrib/text/embedding.py @@ -45,7 +45,7 @@ class TokenEmbedding(indexer.TokenIndexer): `TokenEmbedding.get_embedding_and_pretrained_file_names()`. Alternatively, to load embedding vectors from a custom pre-trained token embedding file, use - :class:`~mxnet.text.embedding.CustomEmbedding`. + :class:`~mxnet.contrib.text.embedding.CustomEmbedding`. For every unknown token, if its representation `self.unknown_token` is encountered in the pre-trained token embedding file, index 0 of `self.idx_to_vec` maps to the pre-trained token @@ -56,7 +56,7 @@ class TokenEmbedding(indexer.TokenIndexer): first-encountered token embedding vector will be loaded and the rest will be skipped. For the same token, its index and embedding vector may vary across different instances of - :class:`~mxnet.text.embedding.TokenEmbedding`. + :class:`~mxnet.contrib.text.embedding.TokenEmbedding`. Properties @@ -298,16 +298,16 @@ def register(embedding_cls): Once an embedding is registered, we can create an instance of this embedding with - :func:`~mxnet.text.embedding.TokenEmbedding.create`. + :func:`~mxnet.contrib.text.embedding.TokenEmbedding.create`. Examples -------- - >>> @mxnet.text.embedding.TokenEmbedding.register - ... class MyTextEmbed(mxnet.text.embedding.TokenEmbedding): + >>> @mxnet.contrib.text.embedding.TokenEmbedding.register + ... class MyTextEmbed(mxnet.contrib.text.embedding.TokenEmbedding): ... def __init__(self, pretrained_file_name='my_pretrain_file'): ... pass - >>> embed = mxnet.text.embedding.TokenEmbedding.create('MyTokenEmbed') + >>> embed = mxnet.contrib.text.embedding.TokenEmbedding.create('MyTokenEmbed') >>> print(type(embed)) """ @@ -317,13 +317,13 @@ def register(embedding_cls): @staticmethod def create(embedding_name, **kwargs): - """Creates an instance of :class:`~mxnet.text.embedding.TokenEmbedding`. + """Creates an instance of :class:`~mxnet.contrib.text.embedding.TokenEmbedding`. Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid `embedding_name` and `pretrained_file_name`, use - `mxnet.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names()`. + `mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names()`. Parameters @@ -334,7 +334,7 @@ def create(embedding_name, **kwargs): Returns ------- - :class:`~mxnet.text.glossary.TokenEmbedding`: + :class:`~mxnet.contrib.text.glossary.TokenEmbedding`: A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file. """ @@ -367,8 +367,8 @@ def get_embedding_and_pretrained_file_names(embedding_name=None): To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use - `mxnet.text.embedding.TokenEmbedding.create(embedding_name, pretrained_file_name)`. This - method returns all the valid names of `pretrained_file_name` for the specified + `mxnet.contrib.text.embedding.TokenEmbedding.create(embedding_name, pretrained_file_name)`. + This method returns all the valid names of `pretrained_file_name` for the specified `embedding_name`. If `embedding_name` is set to None, this method returns all the valid names of `embedding_name` with associated `pretrained_file_name`. @@ -386,7 +386,8 @@ def get_embedding_and_pretrained_file_names(embedding_name=None): for the specified token embedding name (`embedding_name`). If the text embeding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (`pretrained_file_name`). They can be plugged into - `mxnet.text.embedding.TokenEmbedding.create(embedding_name, pretrained_file_name)`. + `mxnet.contrib.text.embedding.TokenEmbedding.create(embedding_name, + pretrained_file_name)`. """ text_embedding_reg = registry.get_registry(TokenEmbedding) diff --git a/python/mxnet/contrib/text/glossary.py b/python/mxnet/contrib/text/glossary.py index 2fd46a39241e..40f325830174 100644 --- a/python/mxnet/contrib/text/glossary.py +++ b/python/mxnet/contrib/text/glossary.py @@ -16,12 +16,14 @@ # under the License. # coding: utf-8 +# pylint: disable=super-init-not-called """Index text tokens and load their embeddings.""" from __future__ import absolute_import from __future__ import print_function from . import embedding +from . import indexer from ... import ndarray as nd @@ -31,35 +33,16 @@ class Glossary(embedding.TokenEmbedding): For each indexed token in a glossary, an embedding vector will be associated with it. Such embedding vectors can be loaded from externally hosted or custom pre-trained token embedding - files, such as via instances of :class:`~mxnet.text.embedding.TokenEmbedding`. + files, such as via instances of :class:`~mxnet.contrib.text.embedding.TokenEmbedding`. Parameters ---------- - counter : collections.Counter or None, default None - Counts text token frequencies in the text data. Its keys will be indexed according to - frequency thresholds such as `most_freq_count` and `min_freq`. Keys of `counter`, - `unknown_token`, and values of `reserved_tokens` must be of the same hashable type. - Examples: str, int, and tuple. + token_indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer` + It contains the indexed tokens to load, where each token is associated with an index. token_embeddings : instance or list of :class:`~TokenEmbedding` One or multiple pre-trained token embeddings to load. If it is a list of multiple embeddings, these embedding vectors will be concatenated for each token. - most_freq_count : None or int, default None - The maximum possible number of the most frequent tokens in the keys of `counter` that can be - indexed. Note that this argument does not count any token from `reserved_tokens`. If this - argument is None or larger than its largest possible value restricted by `counter` and - `reserved_tokens`, this argument becomes positive infinity. - min_freq : int, default 1 - The minimum frequency required for a token in the keys of `counter` to be indexed. - unknown_token : hashable object, default '' - The representation for any unknown token. In other words, any unknown token will be indexed - as the same representation. Keys of `counter`, `unknown_token`, and values of - `reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple. - reserved_tokens : list of hashable objects or None, default None - A list of reserved tokens that will always be indexed, such as special symbols representing - padding, beginning of sentence, and end of sentence. It cannot contain `unknown_token`, or - duplicate reserved tokens. Keys of `counter`, `unknown_token`, and values of - `reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple. Properties @@ -80,23 +63,30 @@ class Glossary(embedding.TokenEmbedding): embedding vector. The largest valid index maps to the initialized embedding vector for every reserved token, such as an unknown_token token and a padding token. """ - def __init__(self, counter, token_embeddings, most_freq_count=None, min_freq=1, - unknown_token='', reserved_tokens=None): + def __init__(self, token_indexer, token_embeddings): + + # Sanity checks. + assert isinstance(token_indexer, indexer.TokenIndexer), \ + 'The argument `token_indexer` must be an instance of ' \ + 'mxnet.contrib.text.indexer.TokenIndexer.' if not isinstance(token_embeddings, list): token_embeddings = [token_embeddings] - # Sanity checks. for embed in token_embeddings: assert isinstance(embed, embedding.TokenEmbedding), \ - 'The parameter `token_embeddings` must be an instance or a list of instances ' \ - 'of `mxnet.text.embedding.TextEmbed` whose embedding vectors will be loaded or ' \ - 'concatenated-then-loaded to map to the indexed tokens.' - - # Index tokens from keys of `counter` and reserved tokens. - super(Glossary, self).__init__(counter=counter, most_freq_count=most_freq_count, - min_freq=min_freq, unknown_token=unknown_token, - reserved_tokens=reserved_tokens) + 'The argument `token_embeddings` must be an instance or a list of instances ' \ + 'of `mxnet.contrib.text.embedding.TextEmbedding` whose embedding vectors will be' \ + 'loaded or concatenated-then-loaded to map to the indexed tokens.' + + # Index tokens. + self._token_to_idx = token_indexer.token_to_idx.copy() \ + if token_indexer.token_to_idx is not None else None + self._idx_to_token = token_indexer.idx_to_token[:] \ + if token_indexer.idx_to_token is not None else None + self._unknown_token = token_indexer.unknown_token + self._reserved_tokens = token_indexer.reserved_tokens[:] \ + if token_indexer.reserved_tokens is not None else None # Set _idx_to_vec so that indices of tokens from keys of `counter` are # associated with token embedding vectors from `token_embeddings`. @@ -109,7 +99,7 @@ def _set_idx_to_vec_by_embeds(self, token_embeddings): Parameters ---------- token_embeddings : an instance or a list of instances of - :class:`~mxnet.text.embedding.TokenEmbedding` + :class:`~mxnet.contrib.text.embedding.TokenEmbedding` One or multiple pre-trained token embeddings to load. If it is a list of multiple embeddings, these embedding vectors will be concatenated for each token. """ diff --git a/python/mxnet/contrib/text/indexer.py b/python/mxnet/contrib/text/indexer.py index 409dfb0bb229..1add7cf26719 100644 --- a/python/mxnet/contrib/text/indexer.py +++ b/python/mxnet/contrib/text/indexer.py @@ -32,8 +32,8 @@ class TokenIndexer(object): Build indices for the unknown token, reserved tokens, and input counter keys. Indexed tokens can - be used by instances of :class:`~mxnet.text.embedding.TokenEmbedding`, such as instances of - :class:`~mxnet.text.glossary.Glossary`. + be used by instances of :class:`~mxnet.contrib.text.embedding.TokenEmbedding`, such as instances + of :class:`~mxnet.contrib.text.glossary.Glossary`. Parameters diff --git a/tests/python/unittest/test_contrib_text.py b/tests/python/unittest/test_contrib_text.py index 99423aa7d547..dc0e7bc06c57 100644 --- a/tests/python/unittest/test_contrib_text.py +++ b/tests/python/unittest/test_contrib_text.py @@ -422,8 +422,9 @@ def test_glossary_with_one_embed(): counter = Counter(['a', 'b', 'b', 'c', 'c', 'c', 'some_word$']) - g1 = text.glossary.Glossary(counter, my_embed, most_freq_count=None, min_freq=1, - unknown_token='', reserved_tokens=['']) + i1 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, unknown_token='', + reserved_tokens=['']) + g1 = text.glossary.Glossary(i1, my_embed) assert g1.token_to_idx == {'': 0, '': 1, 'c': 2, 'b': 3, 'a': 4, 'some_word$': 5} assert g1.idx_to_token == ['', '', 'c', 'b', 'a', 'some_word$'] @@ -546,8 +547,9 @@ def test_glossary_with_two_embeds(): counter = Counter(['a', 'b', 'b', 'c', 'c', 'c', 'some_word$']) - g1 = text.glossary.Glossary(counter, [my_embed1, my_embed2], most_freq_count=None, min_freq=1, - unknown_token='', reserved_tokens=None) + i1 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, unknown_token='', + reserved_tokens=None) + g1 = text.glossary.Glossary(i1, [my_embed1, my_embed2]) assert g1.token_to_idx == {'': 0, 'c': 1, 'b': 2, 'a': 3, 'some_word$': 4} assert g1.idx_to_token == ['', 'c', 'b', 'a', 'some_word$'] @@ -599,8 +601,9 @@ def test_glossary_with_two_embeds(): my_embed4 = text.embedding.CustomEmbedding(pretrain_file_path4, elem_delim, unknown_token='') - g2 = text.glossary.Glossary(counter, [my_embed3, my_embed4], most_freq_count=None, min_freq=1, - unknown_token='', reserved_tokens=None) + i2 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, unknown_token='', + reserved_tokens=None) + g2 = text.glossary.Glossary(i2, [my_embed3, my_embed4]) assert_almost_equal(g2.idx_to_vec.asnumpy(), np.array([[1.1, 1.2, 1.3, 1.4, 1.5, 0.11, 0.12, 0.13, 0.14, 0.15], @@ -614,8 +617,9 @@ def test_glossary_with_two_embeds(): 0.11, 0.12, 0.13, 0.14, 0.15]]) ) - g3 = text.glossary.Glossary(counter, [my_embed3, my_embed4], most_freq_count=None, min_freq=1, - unknown_token='', reserved_tokens=None) + i3 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, + unknown_token='', reserved_tokens=None) + g3 = text.glossary.Glossary(i3, [my_embed3, my_embed4]) assert_almost_equal(g3.idx_to_vec.asnumpy(), np.array([[1.1, 1.2, 1.3, 1.4, 1.5, 0.11, 0.12, 0.13, 0.14, 0.15], @@ -629,8 +633,9 @@ def test_glossary_with_two_embeds(): 0.11, 0.12, 0.13, 0.14, 0.15]]) ) - g4 = text.glossary.Glossary(counter, [my_embed3, my_embed4],most_freq_count=None, min_freq=1, - unknown_token='', reserved_tokens=None) + i4 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, + unknown_token='', reserved_tokens=None) + g4 = text.glossary.Glossary(i4, [my_embed3, my_embed4]) assert_almost_equal(g4.idx_to_vec.asnumpy(), np.array([[1.1, 1.2, 1.3, 1.4, 1.5, 0.11, 0.12, 0.13, 0.14, 0.15], @@ -646,8 +651,9 @@ def test_glossary_with_two_embeds(): counter2 = Counter(['b', 'b', 'c', 'c', 'c', 'some_word$']) - g5 = text.glossary.Glossary(counter2, [my_embed3, my_embed4], most_freq_count=None, min_freq=1, - unknown_token='a', reserved_tokens=None) + i5 = text.indexer.TokenIndexer(counter2, most_freq_count=None, min_freq=1, unknown_token='a', + reserved_tokens=None) + g5 = text.glossary.Glossary(i5, [my_embed3, my_embed4]) assert g5.token_to_idx == {'a': 0, 'c': 1, 'b': 2, 'some_word$': 3} assert g5.idx_to_token == ['a', 'c', 'b', 'some_word$'] assert_almost_equal(g5.idx_to_vec.asnumpy(),