Skip to content

Tokenizer splitHyphenated regression #1289

Open
@gangeli

Description

@gangeli

The following snippet of code seems to correctly split on the hyphen in "year-end" in 3.9.2, but no longer in 4.4.0. Is this expected behavior?

public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}

Old output: [year, -, end]
New output: [year-end]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions