Closed
Description
Description:
edu.stanford.nlp.pipeline.StanfordCoreNLP
throws an error if you try to tokenize a string with all possible characters ("... a b c d ..."
) divided by space. Probably it's also worth to mention that string without space between characters ("...abcd..."
) is tokenized successfully.
Prerequisites:
- java
openjdk 17.0.2 2022-01-18
- scala
2.13.8
- lib
ivy"edu.stanford.nlp:stanford-corenlp:4.5.0"
Minimal example:
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
val pipeline = {
val props = new Properties()
props.setProperty("annotators", "tokenize")
new StanfordCoreNLP(props)
}
val text = (Char.MinValue to Char.MaxValue).mkString(" ")
pipeline.processToCoreDocument(text)
Error:
java.lang.Error: Error: could not match input
at edu.stanford.nlp.process.PTBLexer.zzScanError(PTBLexer.java:61605)
at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:63479)
at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:301)
at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:185)
at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:69)
at edu.stanford.nlp.process.AbstractTokenizer.tokenize(AbstractTokenizer.java:111)
at edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:420)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:744)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.process(StanfordCoreNLP.java:793)
...