-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for legal citations/references #195
Comments
At what point are you running into problems specifically? I'm guessing you have a set up of marked-up references with the fields "authority" and "reference" marked. Are the equivalent fields in the test dataset still not getting recognised? I did some work with AnyStyle to recognise similar kinds of non-science documents (regulations, standards). It worked pretty well. Some things you can look at doing are making your own type normaliser (adding rules to recognise, for example, legislation) and/or adding features. For example, you might want to recognise commonly used abbreviations in the relevant field, similar to how the "journal name" feature recognises many common journals. For example, if legal and German, you might want to have a feature that recognise "BVerG" for decisions of the Bundesverfassungsgericht. |
@a-fent Thanks! Would you have some code examples on how you did it? My problem is always that I have to learn Ruby while going and even though I have a general grasp of how AnyStyle/Wapiti works, implementing a specific solution is still a challenge for me. Some snippets (or links to published code) on how to implement and use my own features & normalizers would be incredibly helpful. |
I would also like to do this for signal phrases such as "see also", "cf.", "on XXXX, see", "for ..., see" etc. which need to be discarded for reference parsing but could also contain useful information (agreement/disagreement) for later analysis |
To understand the code better: how/where in the code are features and lables being connected? How would I set up a Feature and tell Wapiti that if I encounter "BVerfG" there is a high chance that this is an "authority"? |
I guess what I want to say is that it would be great to have a hands-on documentation/tutorial on how to extend the current feature -> label -> normalizer workflow... |
I'll have a go at explaining, if it helps, we could perh. turn it into a FAQ. It's kind of advanced usage, tbh, partly because it's not very well documented in the original wapiti code that ruby-wapiti is based on. Background: Features and LabelsAnyStyle labels each token (word separated by whitespace) based on its features. A feature is something like "how is this word capitalised?" or "is this word part of a known journal name?". Position in the whole string and the labelling of surrounding tokens are also taken into account. Each feature of each token is observed - that is, it is assigned a particular value. An easy example is observing the capitalisation: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/feature/caps.rb . This feature of a token can be
Adding an authorityA basic "court authority" feature might be something like:
Adding the feature to the patternYou also need to tell Wapiti about this feature so that it's included in its model estimation when training. This is the bit that's not well documented - refer to AnyStyle's default pattern: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/support/parser.txt I think of this, but I could be wrong, as reserving spaces for the feature values and setting how they are used in the CRF estimation. So for the IsCourt feature you might add lines at the end for the two possible values, :court and something else.
Save this in a new file, e.g. "my_pattern.txt". I'm afraid if I ever knew exactly how this worked, I've forgotten. Putting it togetherYou need a training file that should have examples showing the token "BVerfG" linked to the label "authority". When the model is trained, this will link the feature :court to the label authority. Note there is no particular reason you have to restrict yourself to the fields/labels used in CSL, Bibtex or whatever, other than if you want to use your parsed data in a particular way later. You lastly need your own parser class. Some of this is just boiler-plate
You can then use this parser as you would AnyStyle.parser (call #train, #parse etc). Using your own type classifierThe assignment of a citation to a particular type of document (e.g. journal article, book, PhD thesis) is done after labelling. It is done heuristically based on the presence or absence of particular fields (e.g. a journal name, a publisher) and the values of fields: You might, for example, want to recognise additional types, such as "statute" or "case" (those look to be relevant types in Zotero). You might say that anything with a "court" label is a case and anything with a "statute name" field is a law. Your custom type normaliser would then have lines like:
You'll need to tell your customised Parser class to use your own type normaliser instead of AnyStyle's default - see above. Random suggestions
|
There is some information about patterns in the original wapiti codebase here: https://github.com/Jekub/Wapiti/blob/master/src/pattern.c . This may or may not be illuminating for you. It also bears mentioning that the fact that such customisations are possible and relatively easy with AnyStyle is down to the clever and elegant design of the software by @inukshuk |
I totally agree that AnyStyle has an excellent design - I come from another library which was really hard to work with and really appreciate it! So the only missing piece of the puzzle is how to translate from what I want to add in Ruby to the Wapiti pattern, since from looking at parser.txt, it isn't clear to me at all how to translate the feature selection to this pattern. Maybe Sylvester can tell us more about the details. Does the order or the naming of the pattern matter? What do the different parts of the pattern mean? I.e., in |
As they say, If everything else fails, look at the manual. Here's something from the Wapiti docs (somewhat reformatted): Pattern files are almost compatible with CRF++ templates. Empty lines as well as all characters appearing after a The first char of a pattern must be either The remaining part of the pattern is used to build an observation string. Each marker of the kind For example, if your data is:
The pattern Note that sequences are implicitely padded with special tokens such as Wapiti also supports a simple kind of matching, that can be useful, for example, in natural language processing applications. This is done using two other commands of the form The regular expression implemented is just a subset of classical regular expression found in classical unix system but is generally enough for most tasks. The recognized subset is quite simple. First for matching characters:
And the constructs :
So, for example, the regexp For the commands, |
So I get that AnyStyle only seems to use token extraction ( |
Thanks @a-fent for the write-up above! If I remember correctly, when we re-designed the parser the last time we tried to make it possible to use it with different patterns/features even without sub-classing -- though I'm not completely sure we succeeded in this? In any case, I think you could even use the parser as is and remove, add, or manipulate the default features and normalizers. At least that was my intention if you want to make some small adjustments. For bigger changes sub-classing is obviously still the best option. I think there is a lot that can be done to still improve individual normalizers; it's also fairly trivial to add more normalizers to AnyStyle -- either to the default configuration or optional. Adding more labels or even features is more problematic because some care must be taken that doing does not yield worse parse results for the current set of supported references. But thanks to the gold set I feel like have a fairly good setup in place to protect us from bad regressions. Wapit's pattern files are a little cryptic, I agree. If I remember correctly, I decided to use only @cboulanger I suggest to look at some simpler pattern files to get a better understanding of them. You can find some example here. At a very high-level what you need to understand is only that wapiti takes a kind of tabular input. Each line is a token (the first word) followed by a fixed number of 'feature-words'; and a final label (this label is used for training; later on it is the thing that will be predicted by the model). The pattern file is away to give wapiti instructions how to interpret this input (the feature words). You can use the pattern file to extract a lot of information even from very simple inputs -- e.g. the most simple input would be just the token word itself. AnyStyle's approach is to analyze the tokens in Ruby, it's basically a pre-processor to compile the tabular input for wapiti. While each token is a line; successive lines are a sequence. I basically wrote |
Are you saying that the |
No, the pattern file isn't generated, but you need to write it by hand only if you want to add a new feature to the model. I would think that something like That said, if you don't have a dictionary of known court abbreviations, but you would know that it is extremely common for them to use abbreviations with mixed capitalisation. Using our caps feature they would all get classified as 'other' -- maybe that feature could be extended by additional patterns that would help distinguish these court abbreviations. Then, again, given some training material that linked these abbreviated court names to the court label I think should be enough. The way I would try to think about it is this: when you look at the reference, what information makes you know how to classify a given word. Then print out the set of feature information that AnyStyle currently creates for that word. If the salient information can be inferred by those features then the model should easily learn it if you feed it some consistently labelled data. I don't mean to discourage adding new features, I just think, in general, that the model is easier to understand and reason about if there are less features (also labels) -- in fact, I would suspect we already have more features than necessary though I have no hard evidence to back up this suspicion. |
Thanks for that. Actually, the simpler the solution is the better, and if I don't need to add a feature I'd gladly omit that step! I was simply thinking that this was the thing to do. So in order to catch all courts I can also generate a list of courts with the label and use this synthetic training material to tell AnyStyle about them. I could also generate synthetic training material for court decisions, legal codes, and law journals, and just include them in the training material. So all that would be left to do is to add a categorizer to translate the labels into CSL fields/types. Is this correct? |
Well, @a-fent's assesment will be more on point here since he has already worked with similar data. I'd definitely start with the current set of features; it might be necessary to add one or two new labels (like authority -- I think we don't use that one yet), but you can do this simply by supplying training data. Normalizers should be easy to modify or add so that the end result includes the necessary CSL fields -- if they are general purpose we can add them here, but they're easy to add to your own setup as well. Similarly, the type classifier can be amended quite easily. I'd explore adding new features only if the results from the labeling phase are inferior even when supplying sufficient training data. |
Yes, definitely look to use training before messing around with features. Train with real, or at least realistic, data - i.e. full citations, not lists of words. Make a test set of marked-up citations that the Parser isn't trained on so you can track regressions. If the set of entities you're interested in (e.g. courts, laws, cases) is fairly small and they have distinctive identities (=BVerfG=) they will be picked up quickly by the word-literal feature @inukshuk mentions. Adding a feature was worth it for me b/c I had a (1) a large set of relevant entities with (2) names that were prone to confusion with other labels and (3) messy data with mixed data types and inconsistent citation formats. A dictionary-type feature (like journal, place) is probably only worth it if you have hundreds or more entities. |
Since the technical questions have been discussed in this issue, I am continuing on from the issue on signal words here - this is not so much about legal citations and authorities as such (because I haven't gotten to this part yet), but about training for recognizing the introductory signal words and phrases mentioned in the other issue. To recap, what I want to achieve is that AnyStyle recognizes these phrases (see examples) and label them so that they won't be labelled as part of the reference and can also serve as an indicator of where two references in the same line can be separated. I am inclined to think that a custom dictionary feature performs better here because the are almost always a very strong indication of the label and, and training (with synthetic data) hasn't been successful so far. Of course, since it is not about the words only, but about whole phrases ( If I want to test whether training using the existing features or a new custom dictionary feature perfoms better, there are two things left that I haven't fully understood yet. The first one is just a clarification: I assume it is not possible to add a generic feature that would allow to associate a list of words to a particular label, since each feature (=>label) requires its own column for Wapiti. If this is so, I don't understand yet how the column number in a new pattern that I add to
@a-fent I don't know if your code is open source and published but if so, it would probably easiest if you could just point me to it. |
I think you're putting to much hope on a dictionary feature for this. It's helpful to look at the data that AnyStyle prepares for wapiti -- this is also what the columns in the pattern refer to. For example: require 'anystyle'
AnyStyle.parser.prepare('Vgl. John Doe, 2022') Returns the dataset including all the feature observations. You can inspect it, e.g., to look at the puts AnyStyle.parser.prepare('Vgl. John Doe, 2022.').to_s
Adding a dictionary for some of the words you're concerned with here, would add one more column (which you can then reference in the pattern file). One benefit of the pattern file is that you can relate multiple observations of the same token and, importantly also neighbouring tokens. I don't really see how adding a dictionary for these signal words would be that helpful:
|
Hi, thanks. Of course I trust your judgement on that. Maybe I just need more manually annotated material. The synthetic one simply has a random sample of the signal words at the beginning and between references, so that might be the problem. |
Ok, I put some more love in the parser annotations. You can process some particular nasty footnotes here If you select "Model" -> "footnotes", then "Parse/Segment", then "Parse/Segment" -> "Auto-tag ...", you get: Of course, this is the result of training with itself, so it is not new unseen data - which has performed much worse. but I hope it will get better with more annotations. |
In my target literature, there are many references to court cases, regulations and laws. AnyStyle does not support this well, as there are no features, normalizers or formatters that cover the citation practices regarding these references. Is this something that should go into the AnyStyle core or rather a case of writing a custom parser? In any case, could you give some suggestions how to extend the parser with such functionality? if I understand the CSL specification correctly, I would need to output the "authority" (such as a court) and the "references" (for the case number etc.) CSL fields. To this end, I assume additional features should be added (such as weighing "X v. Y" highly as an indicator of a court case (in Germany, it would be a list of court abbreviations).
The text was updated successfully, but these errors were encountered: