Improve tokenizer #23

inukshuk · 2014-07-22T08:58:56Z

User request:

When a colon does not have a blank space following it (sometimes happens erroneously, e.g. “London:Routledge”), the anteceding and the preceding word are considered a unit and cannot be assigned different labels (so it also cannot be trained as incorrect). I think generally that behaviour is not desirable, maybe the parser could be altered to break up X:Y pairs by default?

a-fent · 2017-11-10T16:00:42Z

A related issue: in many citation styles I'm familiar with, journal articles cite the volume with issue in parentheses directly after without a space: "J. Dubious Science, 15(4), pp.45-90".

At the moment examples of that in training.txt are marked up with the whole token belonging to "volume", but they are conceptually separate fields. However dealing with this would mean tweaking the tokenizer further.

For example: https://github.com/inukshuk/anystyle-parser/blob/master/resources/train.txt#L178

inukshuk · 2017-11-10T17:05:04Z

Yes, for this reason we currently extract issue (and in some cases page numbers) from the volume field in the normalizer.

But this is a good time to make adjustments to the tokenizer, and using colons as separators will probably work. I've been holding off on this because it might invalidate existing training data. We'll have to ensure that a reference can be reconstructed from its tagged form; i.e., there has to be a way to tell whether the two tokens a: and b were a: b or a:b in the original reference.

inukshuk closed this as completed May 29, 2018

EmmanuelCharpentier mentioned this issue May 17, 2023

Tokenizer doesn't parse Volume/issue typeset with no space. #212

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tokenizer #23

Improve tokenizer #23

inukshuk commented Jul 22, 2014

a-fent commented Nov 10, 2017

inukshuk commented Nov 10, 2017

Improve tokenizer #23

Improve tokenizer #23

Comments

inukshuk commented Jul 22, 2014

a-fent commented Nov 10, 2017

inukshuk commented Nov 10, 2017