-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Describe the bug
A user tried to create a corpus from a .txt file, but PCT raised an error even before parsing. The text file has one column with the transcription, without punctuation or special characters.
The file from the user was originally in UTF-16 LE BOM, and when I converted it to UTF-8, PCT could load it without problems. Additionally, PCT doesn't import a one-column file, so I needed to create a second column by copying from the existing one.
- Is there a way to detect the encoding (or at least let the user select their encoding) and parse the file accordingly?
- It would be great if PCT could automatically create another column if the text file only has a single column of transcriptions.
Traceback (most recent call last):
File "D:\PycharmProjects\CorpusTools\corpustools\decorators.py", line 12, in do_check
function(*args,**kwargs)
File "D:\PycharmProjects\CorpusTools\corpustools\gui\iogui.py", line 757, in inspect
atts, coldelim = inspect_csv(self.pathWidget.value())
File "D:\PycharmProjects\CorpusTools\corpustools\corpus\io\csv.py", line 49, in inspect_csv
head = f.readline().strip()
File "C:\Users\Stanley\anaconda3\envs\PCT\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "C:\Users\Stanley\anaconda3\envs\PCT\lib\encodings\utf_8_sig.py", line 69, in _buffer_decode
return codecs.utf_8_decode(input, errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
The identical UnicodeDecodeError has been reported before as #726
Sample corpus file
can be found at Phonological_CorpusTools_Public/from_users/dict_sharanahua_fixed_HEAD WORDS ONLY.txt
To Reproduce
Steps to reproduce the behavior:
- Go to 'Load corpus'
- Go to 'Create corpus from file'
- Click on 'Choose file...'
- Select the .txt file
- See the error
Additional context
My text editor reports that the encoding of the .txt file is UTF-16 LE BOM. When changed to UTF-8, PCT could load it.