-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tokenizer #23
Comments
A related issue: in many citation styles I'm familiar with, journal articles cite the volume with issue in parentheses directly after without a space: "J. Dubious Science, 15(4), pp.45-90". At the moment examples of that in training.txt are marked up with the whole token belonging to "volume", but they are conceptually separate fields. However dealing with this would mean tweaking the tokenizer further. For example: https://github.com/inukshuk/anystyle-parser/blob/master/resources/train.txt#L178 |
Yes, for this reason we currently extract issue (and in some cases page numbers) from the volume field in the normalizer. But this is a good time to make adjustments to the tokenizer, and using colons as separators will probably work. I've been holding off on this because it might invalidate existing training data. We'll have to ensure that a reference can be reconstructed from its tagged form; i.e., there has to be a way to tell whether the two tokens |
User request:
When a colon does not have a blank space following it (sometimes happens erroneously, e.g. “London:Routledge”), the anteceding and the preceding word are considered a unit and cannot be assigned different labels (so it also cannot be trained as incorrect). I think generally that behaviour is not desirable, maybe the parser could be altered to break up X:Y pairs by default?
The text was updated successfully, but these errors were encountered: