Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Off-line encoding #37

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

cifkao
Copy link
Contributor

@cifkao cifkao commented Jun 20, 2018

This PR adds 2 scripts:

  • dump_sentences.py dumps all SentEval sentences to stdout.
  • eval_saved.py loads the saved sentences and the corresponding embeddings and runs SentEval on them.

This removes the need to run the encoding inside the batcher API, allowing to separate encoding and evaluation. The reasons for doing this are:

  • It's tricky to run the embedding model in the same process as SentEval, especially if the model uses a different framework (e.g. TensorFlow) or if the machine has only one GPU. It's easier to do the encoding off-line (separately from the evaluation).
  • You can run encoding on a large GPU and evaluation on a small GPU (possibly on a different machine) so that you don't waste resources.
  • You can save time by only encoding each sentence once. (SentEval has a lot of duplicate sentences.)

Example usage (perhaps this should be described in the README):

python examples/dump_sentences.py | sort -u >senteval.txt
...  # run your model to get the embeddings for senteval.txt and save them to emb.npy
python examples/eval_saved.py senteval.txt emb.npy

@aconneau
Copy link
Contributor

Hi,
thanks for the PR, that's indeed an interesting feature to have as an example.
I will look at the code soon.
Thanks,
Alexis

@aconneau aconneau closed this Jun 27, 2018
@aconneau aconneau reopened this Jun 27, 2018
@aconneau
Copy link
Contributor

Oops, I closed the task but not on purpose. Just re-opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants