-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training finder with xml documents #215
Comments
You can train the parser module with XML files (see the samples in res/parser). However, the finder module is trained using TTX files (see res/finder). TTX is a custom format that makes annotating long text documents relatively easy. Using XML for this would be cleaner but would require a lot of extra tooling to make it feasible (for training documents you need to label every line of a document after all). Please note that the documents in res/finder are only one part of the sources used to train the default finder module. We can't publish the other documents due to copyrights. |
thank you. So the xml must have that specific structure or i can use my own annotation? |
The XML used for the parser must use the structure as in the sample files, i.e. one The parse command takes text input (one reference per line). But of course you can use finder and parser module in combination, for example with the CLI tool. The finder module would extract the references from a PDF or text document and pass it on to the parser module which would then segment and label each reference individually. |
Basically, the finder module takes entire documents; it splits the document into lines and operates on each line: every line is assigned a label; multiple lines with the same label are grouped together; reference groups are extracted; a heuristic based on regular expressions is applied to try and separate individual references. The parser module takes one or more lines as input; each line is interpreted as a single reference; the line is split into word-tokens and each word is labeled; successive words with the same label are grouped together; normalizer routines are applied for specific labels. |
thank you again.Does anystyle provide a converter for creating TTX files? |
Yes you can save documents as TTX. TTX is just plain text but with a certain prefix on each line; it was build for manual annotation using diff and simple text editors like nvim. You can also find more background info in some issue threads here. |
So if i understood correctly: |
When using the CLI tool you can pass the model file as an argument from the command line. If you use the Ruby Gem you can set |
while trying to training the parser i got this error |
I think this is probably a cryptic error message due to invalid training data. It's usually something like a blank segment (i.e., something like |
I am trying to train anystyle with a set of xml documents. I do not find many information on how to do that and i have a couple of questions:
Thank you in advance.
The text was updated successfully, but these errors were encountered: