Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tokenizer #23

Closed
inukshuk opened this issue Jul 22, 2014 · 2 comments
Closed

Improve tokenizer #23

inukshuk opened this issue Jul 22, 2014 · 2 comments

Comments

@inukshuk
Copy link
Owner

User request:

When a colon does not have a blank space following it (sometimes happens erroneously, e.g. “London:Routledge”), the anteceding and the preceding word are considered a unit and cannot be assigned different labels (so it also cannot be trained as incorrect). I think generally that behaviour is not desirable, maybe the parser could be altered to break up X:Y pairs by default?

@a-fent
Copy link
Collaborator

a-fent commented Nov 10, 2017

A related issue: in many citation styles I'm familiar with, journal articles cite the volume with issue in parentheses directly after without a space: "J. Dubious Science, 15(4), pp.45-90".

At the moment examples of that in training.txt are marked up with the whole token belonging to "volume", but they are conceptually separate fields. However dealing with this would mean tweaking the tokenizer further.

For example: https://github.com/inukshuk/anystyle-parser/blob/master/resources/train.txt#L178

@inukshuk
Copy link
Owner Author

Yes, for this reason we currently extract issue (and in some cases page numbers) from the volume field in the normalizer.

But this is a good time to make adjustments to the tokenizer, and using colons as separators will probably work. I've been holding off on this because it might invalidate existing training data. We'll have to ensure that a reference can be reconstructed from its tagged form; i.e., there has to be a way to tell whether the two tokens a: and b were a: b or a:b in the original reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants