Skip to content

Cannot load movie_lines.txt - 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte #3

@alucard001

Description

@alucard001

Dear Luka

Thanks for this repository. I am currently learning from it and I found the following error from the very beginning of loading the dataset:

sentences = {}
with open('cornell movie-dialogs corpus/movie_lines.txt', 'r') as f:
    for line in f.readlines():
        sentences[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1].replace('\n', "")

And the error is this:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-35-66409f9e14a9> in <module>()
      1 sentences = {}
      2 with open('cornell movie-dialogs corpus/movie_lines.txt', 'r') as f:
----> 3     for line in f.readlines():
      4         sentences[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1].replace('\n', "")

//anaconda/envs/tensorflow/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte

Even if I download directly these text files from your repo: movie_answers_2.txt and movie_questions_2.txt, it shows same error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-ae6b005fad2b> in <module>()
      4 with open('movie_questions_2.txt', 'r', encoding='utf-8') as f:
      5 
----> 6     lines = f.readlines()
      7 
      8     for text in lines:

//anaconda/envs/tensorflow/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 1085: invalid start byte

Can you please tell me what happened and how to fix this?

Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions