Skip to content

Unexpected error thrown on tokenize #1298

Closed
@yakivy

Description

@yakivy

Description:
edu.stanford.nlp.pipeline.StanfordCoreNLP throws an error if you try to tokenize a string with all possible characters ("... a b c d ...") divided by space. Probably it's also worth to mention that string without space between characters ("...abcd...") is tokenized successfully.

Prerequisites:

  • java openjdk 17.0.2 2022-01-18
  • scala 2.13.8
  • lib ivy"edu.stanford.nlp:stanford-corenlp:4.5.0"

Minimal example:

import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
val pipeline = {
    val props = new Properties()
    props.setProperty("annotators", "tokenize")
    new StanfordCoreNLP(props)
}
val text = (Char.MinValue to Char.MaxValue).mkString(" ")
pipeline.processToCoreDocument(text)

Error:

java.lang.Error: Error: could not match input
  at edu.stanford.nlp.process.PTBLexer.zzScanError(PTBLexer.java:61605)
  at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:63479)
  at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:301)
  at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:185)
  at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:69)
  at edu.stanford.nlp.process.AbstractTokenizer.tokenize(AbstractTokenizer.java:111)
  at edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:420)
  at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
  at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:744)
  at edu.stanford.nlp.pipeline.StanfordCoreNLP.process(StanfordCoreNLP.java:793)
  ...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions