Cannot load movie_lines.txt - 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte

Dear Luka

Thanks for this repository.  I am currently learning from it and I found the following error from the very beginning of loading the dataset:

```
sentences = {}
with open('cornell movie-dialogs corpus/movie_lines.txt', 'r') as f:
    for line in f.readlines():
        sentences[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1].replace('\n', "")
```
And the error is this:

```
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-35-66409f9e14a9> in <module>()
      1 sentences = {}
      2 with open('cornell movie-dialogs corpus/movie_lines.txt', 'r') as f:
----> 3     for line in f.readlines():
      4         sentences[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1].replace('\n', "")

//anaconda/envs/tensorflow/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte
```

Even if I download directly these text files from your repo: `movie_answers_2.txt` and `movie_questions_2.txt`, it shows same error:

```
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-ae6b005fad2b> in <module>()
      4 with open('movie_questions_2.txt', 'r', encoding='utf-8') as f:
      5 
----> 6     lines = f.readlines()
      7 
      8     for text in lines:

//anaconda/envs/tensorflow/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 1085: invalid start byte
```

Can you please tell me what happened and how to fix this? 

Thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load movie_lines.txt - 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cannot load movie_lines.txt - 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions