Add optional encoding and errors parameters to LanguageModel constructor

Currently, the `LanguageModel` constructor in the `wordninja.py` file opens the word file using `gzip.open()` without any option to specify the file encoding. This means that users who have word files with non-UTF-8 encoding may encounter decoding errors when using the `wordninja` package.

To address this issue, I propose modifying the `__init__` function in the `wordninja.py` file to include an optional encoding parameter that can be used to specify the `encoding` of the word file. Additionally, I suggest adding an optional `errors` parameter to allow users to customize how decoding errors are handled.

Here's an example of what the modified function could look like:

```python
def __init__(self, word_file, encoding='utf-8', errors='strict'):
    # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
    with gzip.open(word_file) as f:
        words = f.read().decode(encoding=encoding, errors=errors).split()
    self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    self._maxword = max(len(x) for x in words)
```

By adding these optional parameters, users can specify the encoding and error handling behavior of the word file when they create a `LanguageModel` instance, allowing them to use files in different encodings without having to modify the source code.

I plan to submit a pull request with these changes. Please let me know if there are any concerns or suggestions for improvement.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional encoding and errors parameters to LanguageModel constructor #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add optional encoding and errors parameters to LanguageModel constructor #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions