-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training set creation using data from GIANT project? #198
Comments
That's interesting! We've also discussed using CSL to generate training data in the past; I'd be curious to know how a model trained on such data performs with real world input. Obviously you would not want to train a model on 1 billion references, but with such a large resource you could just pick out samples (would also be interesting to see if a model improves after the first couple of thousand references). |
So the basic idea would be to take a random set of publications from that GIANT dataset and for each publication create many citations using a number of different CSL styles; only that instead of plain strings these citations would be converted to XML So the most interesting question is how to generate the "annotated" (by way of XML elems) sequences for different CSL styles. |
You can put any element into the sequence: each element will correspond to a label that is known to the model. From what I saw it should be enough to wrap each generated XML reference in a |
This isn't exactly an issue but a question: Would you consider it feasible and worth-while to adopt the data generated here:
GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing
as training input for AnyStyle?
Just curious if you see enough potential there.
The text was updated successfully, but these errors were encountered: