-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Description
Dear Luka
Thanks for this repository. I am currently learning from it and I found the following error from the very beginning of loading the dataset:
sentences = {}
with open('cornell movie-dialogs corpus/movie_lines.txt', 'r') as f:
for line in f.readlines():
sentences[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1].replace('\n', "")
And the error is this:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-35-66409f9e14a9> in <module>()
1 sentences = {}
2 with open('cornell movie-dialogs corpus/movie_lines.txt', 'r') as f:
----> 3 for line in f.readlines():
4 sentences[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1].replace('\n', "")
//anaconda/envs/tensorflow/lib/python3.5/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte
Even if I download directly these text files from your repo: movie_answers_2.txt and movie_questions_2.txt, it shows same error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-38-ae6b005fad2b> in <module>()
4 with open('movie_questions_2.txt', 'r', encoding='utf-8') as f:
5
----> 6 lines = f.readlines()
7
8 for text in lines:
//anaconda/envs/tensorflow/lib/python3.5/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 1085: invalid start byte
Can you please tell me what happened and how to fix this?
Thank you very much.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels