Skip to content

Conversation

pkuyym
Copy link
Contributor

@pkuyym pkuyym commented Sep 19, 2017

fix #297

@pkuyym pkuyym requested a review from kuke September 19, 2017 14:04
Copy link
Collaborator

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM

* English punctuations and chinese punctuations are removed.
* Insert a whitespace character between two tokens.

Please notice that the released language model only contains chinese simplified characters. When preprocessing done we can begin to train the language model. The key training parameters are '-o 5 --prune 0 1 2 4 4'. Please refer above section for the meaning of each parameter. We also convert the arpa file to binary file using default settings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chinese-->Chinese
When --> After
parameters/parameters-->arguments/argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


* The beginning and trailing whitespace characters are removed.
* English punctuations and chinese punctuations are removed.
* Insert a whitespace character between two tokens.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert a whitespace character between two tokens. --> A whitespace character between two tokens is inserted. for consistence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TODO: any other requirements or tips to add?
#### Mandarin LM

Different from word-based language model, mandarin language model is character-based where each token is a chinese character. We use an internal corpus to train the released mandarin language model. This corpus contains billions of tokens. The preprocessing has small difference from english language model and all steps are:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from word-based language model-->Different from English language model
english-->English
chinese-->Chinese
small-->tiny
all steps are-->main steps include

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pkuyym pkuyym merged commit 88edc4c into PaddlePaddle:develop Sep 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add doc for mandarin LM

2 participants