Skip to content

Segmentation fault when tokenizer.tokenize() is used repetitively #16

@trister95

Description

@trister95

I am trying to tokenize a bunch of txt-files and store them as folia.xml-files.

The first file works fine, but after that the kernel crashes.

A little bit more info:

  • I'm working with the latest ucto version (0.6.4);
  • I've tried this in both VSCode an Colab. In both cases it crashes;
  • I've tried it withPython 3.11.3 and 3.8.10. In both cases it crashes;
  • It doesn't seem to have anything to do with the input txt-files: even if the txt-files are exactly the same it will work for the first file and crash at the second file;
  • When running the code out of a notebook I get this error: Canceled future for execute_request message before replies were done
    The Kernel crashed while executing code in the the current cell or a previous cell;
  • When running the code from the command line I get this error: Segmentation fault.
import ucto
configurationfile_ucto = "tokconfig-nld-historical"

tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)

for f in list_with_paths_to_exact_same_files:
     tokenizer.tokenize(f, output_path)

Am I doing something wrong, or is there a bug here?

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions