Ability to specify custom tokenizer

Currently, the following code is used to split the document in tokens/words for training and classification.

```ruby
str.gsub(/[^\p{WORD}\s]/, '').downcase.split
```

This covers general case, but there could be situations where the user might want to customize the way document is split into words. For example, [tokenizing Japanese text](https://github.com/6/tiny_segmenter) could be a whole different thing. Another situation where a custom tokenizer is needed when the user wants to train the model on N-grams (for example bi-grams such as `New York`). Splitting `New` and `York` from `New York` would mean `New` will be removed if it is present in stopwords. Similarly, `to be or not to be` is another popular example of a significant phrase fully made of common stopwords.. N-grams often play significant role in contextualizing a document and help improve the accuracy of the model in special situations. In many languages (Arabic, Persian, Urdu etc. to name a few) two or more words are combined (they are still separated by space, only put together) to form various linguistic constructs. This could be important if one wants to know who is the author of relatively small piece of text such as those posted on forums.

It would be nice if we can pass a `Lambda` as a tokenizer at the time of classifier initialization or some other more expressive means to tell the system how split the text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to specify custom tokenizer #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ability to specify custom tokenizer #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions