[BUG] Issue from a user (creating corpus from .txt)

**Describe the bug**
A user tried to create a corpus from a .txt file, but PCT raised an error even before parsing. The text file has one column with the transcription, without punctuation or special characters.

The file from the user was originally in UTF-16 LE BOM, and when I converted it to UTF-8, PCT could load it without problems. Additionally, PCT doesn't import a one-column file, so I needed to create a second column by copying from the existing one.

1. Is there a way to detect the encoding (or at least let the user select their encoding) and parse the file accordingly?
2. It would be great if PCT could automatically create another column if the text file only has a single column of transcriptions.

Traceback (most recent call last):
  File "D:\PycharmProjects\CorpusTools\corpustools\decorators.py", line 12, in do_check
    function(*args,**kwargs)
  File "D:\PycharmProjects\CorpusTools\corpustools\gui\iogui.py", line 757, in inspect
    atts, coldelim = inspect_csv(self.pathWidget.value())
  File "D:\PycharmProjects\CorpusTools\corpustools\corpus\io\csv.py", line 49, in inspect_csv
    head = f.readline().strip()
  File "C:\Users\Stanley\anaconda3\envs\PCT\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "C:\Users\Stanley\anaconda3\envs\PCT\lib\encodings\utf_8_sig.py", line 69, in _buffer_decode
    return codecs.utf_8_decode(input, errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

The identical UnicodeDecodeError has been reported before as #726 

**Sample corpus file**
can be found at Phonological_CorpusTools_Public/from_users/dict_sharanahua_fixed_HEAD WORDS ONLY.txt

**To Reproduce**
Steps to reproduce the behavior:
1. Go to 'Load corpus'
2. Go to 'Create corpus from file'
2. Click on 'Choose file...'
3. Select the .txt file
4. See the error

**Additional context**
My text editor reports that the encoding of the .txt file is UTF-16 LE BOM. When changed to UTF-8, PCT could load it.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Issue from a user (creating corpus from .txt) #799

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Issue from a user (creating corpus from .txt) #799

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions