Skip to content

[BUG] Issue from a user (creating corpus from .txt) #799

@stannam

Description

@stannam

Describe the bug
A user tried to create a corpus from a .txt file, but PCT raised an error even before parsing. The text file has one column with the transcription, without punctuation or special characters.

The file from the user was originally in UTF-16 LE BOM, and when I converted it to UTF-8, PCT could load it without problems. Additionally, PCT doesn't import a one-column file, so I needed to create a second column by copying from the existing one.

  1. Is there a way to detect the encoding (or at least let the user select their encoding) and parse the file accordingly?
  2. It would be great if PCT could automatically create another column if the text file only has a single column of transcriptions.

Traceback (most recent call last):
File "D:\PycharmProjects\CorpusTools\corpustools\decorators.py", line 12, in do_check
function(*args,**kwargs)
File "D:\PycharmProjects\CorpusTools\corpustools\gui\iogui.py", line 757, in inspect
atts, coldelim = inspect_csv(self.pathWidget.value())
File "D:\PycharmProjects\CorpusTools\corpustools\corpus\io\csv.py", line 49, in inspect_csv
head = f.readline().strip()
File "C:\Users\Stanley\anaconda3\envs\PCT\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "C:\Users\Stanley\anaconda3\envs\PCT\lib\encodings\utf_8_sig.py", line 69, in _buffer_decode
return codecs.utf_8_decode(input, errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

The identical UnicodeDecodeError has been reported before as #726

Sample corpus file
can be found at Phonological_CorpusTools_Public/from_users/dict_sharanahua_fixed_HEAD WORDS ONLY.txt

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'Load corpus'
  2. Go to 'Create corpus from file'
  3. Click on 'Choose file...'
  4. Select the .txt file
  5. See the error

Additional context
My text editor reports that the encoding of the .txt file is UTF-16 LE BOM. When changed to UTF-8, PCT could load it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions