Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<citedRange unit="entry"><foreign>...</foreign></citedRange> #310

Open
arlogriffiths opened this issue May 23, 2024 · 21 comments
Open

<citedRange unit="entry"><foreign>...</foreign></citedRange> #310

arlogriffiths opened this issue May 23, 2024 · 21 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@arlogriffiths
Copy link
Collaborator

  • code: <bibl><ptr target="bib:Goris1954_01"/><citedRange unit="volume">2</citedRange><citedRange unit="page">319</citedRange><citedRange unit="entry"><foreign>tanggung</foreign></citedRange></bibl>

  • display: Capture d’écran 2024-05-23 à 09 06 29

  • issue: The use of <foreign> in the above context does not yet have the desired effect in display.

@arlogriffiths arlogriffiths added the invalid This doesn't seem right label May 23, 2024
@michaelnmmeyer
Copy link
Member

I would rather not allow the use of XML elements except for <citedRange unit="mixed">, because I am doing pattern matching on the contents of citedRange (for replacing "-" with en dashes, for detecting whether several items are cited, etc.). Doing that on an XML tree is a real mess, and we would not gain much from it.

So, if @danbalogh is OK with that, I would suggest we only allow plain text in <citedRange>, except when @unit is mixed. In the latter case, the element's contents is left unchanged, so there is no issue.

@arlogriffiths
Copy link
Collaborator Author

So it seems you are suggesting that I should encode as follows:

<bibl><ptr target="bib:Goris1954_01"/><citedRange unit="mixed">vol. 2, p. 319, s.v. <foreign>tanggung</foreign></bibl>

That could work, though I'd rather prefer a solution that doesn't force me to diverge form the usual pattern only because I want to see italics.

I imagine that people will only want to see italics in cases where the @Unit of <citedRange> is "entry". Does this limitation help at all to keep the trouble with pattern searching in check?

@michaelnmmeyer
Copy link
Member

@arlogriffiths @danbalogh @manufrancis

This can be made to work. But we first need to decide what are the criteria for determining whether the contents of citedRange refers to a single item or to many, so that the proper form of @unit (singular or plural) is displayed. The current solution does not really work.

We should at least have a way to specify unambiguously whether there is a single item or many. To remove ambiguities, I propose we use the plural form of @unit ("pages", "entries", etc.) to indicate that there are many items, andto use the singular one ("page", "entry", etc.) only when there is a single item. This requires transformations on the existing files, which can be automated.

@danbalogh
Copy link
Collaborator

danbalogh commented May 27, 2024

My number one comment on this, you can probably guess, is that this is a fine detail to which we should not devote a lot of time and effort.

Number two. @michaelnmmeyer, I'm completely OK with not permitting XML elements within citedRange at all; I'm also OK with permitting them in <citedRange unit="mixed">. It has in fact never occurred to me to use any further elements within this one.

Three. Why not make the display of <citedRange unit="entry"> italic by default? That would give the display Arlo wants without the need to use an XML element for formatting, and without the need to switch to mixed unit. Are there any circumstances where the italic display of an "entry" is so undesirable as to rule this solution out?

Four. I don't know in what way the current solution for determining singular/plural unit does not work. I don't recall the details, but if it doesn't work as expected, couldn't the display transformation be tweaked further? I would very much dislike a further complication of our already hellishly complicated reference encoding with the introduction of units like "pages" etc. In addition to the increased (practically doubled) compexity, I have the following concerns with Michaël's [edited] note that conversion to this in existing files can be automated. The smaller one: OK, so conversion can be automated, and we or Michaël makes the change in all existing files on date X. Can we realistically expect all encoders to switch to the new system consistently from date X onward, or would the auto-conversion have to be repeated regularly? The bigger one: if conversion can indeed be automated, then why can't the same algorithm that makes the conversion in the files be used in the display transformation to achieve the desired display without altering and complicating the code?

@michaelnmmeyer
Copy link
Member

@danbalogh

For 3). We have a few cases where several entries are given (as in <foreign>word1</foreign> and <foreign>word2</foreign>), italics are not desirable in this case. There are about a dozen <citedRange unit="entry"> that contain foreign elements. Plain text is used everywhere else (except for a single instance in one of Manu's inscriptions).

For 4). The problem is that the format of references is unrestricted, but that the app is still supposed to guess whether they refer to a single item or to multiple ones, and thus often produces "wrong" results. There is no way to fix this besides encoding the reference explicitly with @unit="mixed", but people (Manu and Arlo so far) apparently do not want to do that.

@danbalogh
Copy link
Collaborator

Anything you and the PIs can agree on will be acceptable for me, so there is no need to go along with my wishes here. However,
For 3) fair enough, italics are not desirable for "and" between entries. But I'm not at all sure that "and" is desirable between entries; if there are only a few cases of this, then I think the straightforward solution would be to replace the "and" in those entries with a comma. It would not be a problem if the comma were displayed in italics, and at the same time, the presence of the comma would be a flag for the algorithm that this is a plural. Although in the EGD we had written that the contents of <citedRange unit="entry"> will not be italicised by default, I now think that it would actually be better to do so for consistent display, and to explicitly forbid using anything in the contents of that element other than the actual entries and commas.

For 4) you have not answered my main concern: how can it be possible to automate replacing the value of @unit with plurals in the code, if it is not possible to automate doing so in display? Apart from that, I of course understand that the reason the display of plurals doesn't work because the format of the references is too lax. What I do not know is precisely what laxities prevent this from happening. My feeling is that inconsistent laxities should be eliminated from our encoding, while consistent laxities should be formulated as supplementary rules for determining when a plural is needed. As best I recall, the main problem was that appendix names could include hyphens and perhaps commas as well. In my opinion, before you decide to further complicate the entire already complex system of reference encoding for the sake of meticulous display in the case of a small minority (1%? 5%?) of all our references, we should be clear on the exact cases where the PIs don't want to use @unit="mixed" and see if those cases could be catered for.

But I repeat: anything is acceptable to me.

@michaelnmmeyer
Copy link
Member

For 4), manual corrections would indeed be needed.

I would rather simplify the current encoding than complicate it. My position is that we would be much happier if we just abandoned all the citedRange and @unit stuff, and used plain references like

<bibl><ptr target="bib:Goris1954_01"/>, vol. 2, p. 319</bibl>

everywhere.

@danbalogh
Copy link
Collaborator

Since I don't think we ever want to make those references machine-actionable to the level of <citedRange>, I think your suggestion should be seriously considered. It is certainly acceptable to me. I think the main reason why Arlo and I introduced <citedRange> and the various units to begin with was that this would enforce some level of consistency (in reference structure and display) across the project. But given that we now have a proliferation of units and, apparently, a number of people dissatisfied with what can be done within the system as well as some "hacked" usage to achieve citations for which the system was not intended, we may indeed be better off abandoning all the complication. Or reducing it greatly, e.g. keeping two permitted values of @unit, namely "page" (to which the existing citedRanges without unit would be converted) and "free" or "mixed", to cover everything else. Anyway, that is for the PIs to decide. If we made this leap, I'm quite sure that a fair amount of manual checking and revision would be necessary to convert the existing encoded references to this system, even though much of the conversion could be handled automatically using the existing algorithm for display transformation to hard-code the citation into the "free" part. But if any other solution also needs manual revision, then this may not be too bad.

@arlogriffiths
Copy link
Collaborator Author

Indeed, the main reason why we introduced <citedRange> and the various units was that this would enforce some level of consistency (in reference structure and display) across the project. I still believe this is important, given the very broad range of bibliographic cultures active in our project and the equally broad range of diligence in matters bibliographic on the part of our team members. I don't think we should let the (in my impression rather minor) rough edges of the system that we have in place lead us to any radical revision.

I don't understand what people could be dissastified about now that we have the option @Unit="mixed" which gives complete freedom, doesn't it? Any "hacked" usage is probably due to people being unaware of the option "@Unit="mixed".

I am flexible about any mix of the variables presented so far, as long as we leave the basics of the present system intact.

Notably, I am willing to play along with the proposal to introduce explicit encoding of plural in the values of @Unit and the partially automated, partially manual path to implementing the change that Michaël proposes. But I am also able to accept sticking to singular values only in exchange for some loss of flexibilty elsewhere in order for the machine to be able to tell whether sg. or pl. is intended.

@michaelnmmeyer
Copy link
Member

To be noted that Dan's proposal is close to LaTeX's behaviour: you have a special case for citing pages (e.g. \cite[43-45]{MyBook}, which produces "MyBook, pp. 43–45"), but everything else (volumes, etc.), has to be encoded manually. About one half of our citedRanges are basic page numbers (ranges or sequences of digits).

@arlogriffiths
Copy link
Collaborator Author

arlogriffiths commented May 27, 2024

even if it's only 50% of our references that we're talking about, I insist that we need a structuring mechanism such as the one we have in place.

@danbalogh
Copy link
Collaborator

Fair enough, let's forget about discarding the existing units. This takes us back to the point where we need solutions for the following details:

  1. Correct display, wherever feasible, of plural units; and
  2. the original issue: italic display for headwords when the unit is "entry", preferably without using XML elements within <citedRange>.

Anything I missed?

For 1, my preferred solution would be to stick to the present units, and let the display transformation algorithm take care of plural display. Since this does not work perfectly in all circumstances, we need information about the cases where it does not (or is not expected to) work correctly, and assess whether any of those cases are systematic. For the systematic cases, it may be possible to add sub-rules for the transformation algorithm. For the non-systematic cases, we would then have to change the problematic citations to @unit="mixed", or live with the inaccurate display.

For 2, I think the best solution is to prescribe that <citedRange> must never contain any further XML elements (contrary to the earlier permission to use <foreign> in entries), except when @unit="mixed", where certain elements (only <foreign>? or also something else?) would be permitted. Next, always display the contents of <citedRange unit="entry"> in italics. And finally, instruct encoders not to put anything within <citedRange unit="entry"> other than headwords and, where applicable, commas; or, where a more complex citation is needed, to use @unit="mixed" instead.

In addition, regarding what I anticipate to be systematic cases in 1, I think it would make sense to prescribe in the EGD and EGC that the contents of <citedRange> with a @unit other than "mixed" must never include a comma or a hyphen unless a plural display is desired. With this rule, any reference where the thing itself contains a hyphen or comma would thus have to be encoded as @unit="mixed", with the singular or plural form written by hand as applicable. One (slightly more complex) alternative to this would be to stipulate the above rule (no hyphens or commas unless plural is intended) only for @unit="page", and for the any other unit (i.e. other than "mixed" or "page"), stipulate that hyphens will not result in plural display, while commas will. My impression is that there exist a number of appendices, figures, plates, etc. with hyphens in their numbers, but very few or practically none with commas in their numbers; and conversely, that we may sometimes want to refer to several appendices, figures, plates, etc., but only very rarely want to refer to ranges of these units. With this setup, @unit="mixed" would have to be used for page numbers containing a hyphen, page numbers containing a comma, non-page numbers containing a comma, and ranges of non-page numbers. Since this is already getting a bit complex, the introduction of special plural units can also remain on the table.

@michaelnmmeyer
Copy link
Member

I think it is important to make transformation rules simple enough and easy to remember, so that people can predict what the output will look like and so that they have a chance to remember them. Perhaps more importantly, they should not change (even for "improvements"), because this would inevitably introduce mistakes in existing entries.

So, I propose to stick to the core of Dan's comment. We would have:

  1. A citedRange that contains a comma or an hyphen (or some other dash) is considered to refer to several items, otherwise to a single one.
  2. <citedRange unit="entry"> is rendered in italics, as if it was wrapped with foreign.
  3. Hyphens are replaced with EN dashes in <citedRange unit="page">. (I do not like this special case, but doing that everywhere might create problems.)
  4. If the displayed result is incorrect, or if some special format is needed, <citedRange unit="mixed"> should be used. XML elements are allowed only for @unit="mixed".

@danbalogh
Copy link
Collaborator

All of this is acceptable to me, provided that the PIs are happy with it. The one thing that worries me is that, at least for the Indian subcontinent, there is a huge number of references to ARIE appendices for which we had specifically required the format <bibl><ptr target="bib:ARIE1962-1963"/><citedRange unit="page">49</citedRange><citedRange unit="appendix">A/1962-63</citedRange><citedRange unit="item">19</citedRange></bibl> (EGD Example 10.4.5.F). There may be similar cases (i.e. a citation type that is both numerous and includes a hyphen) in other corpora as well.
If we stick to the above, then we'll need a solution for these. Ideally, I would prefer if they did not have to be changed to @unit="mixed", because consistency is very difficult to maintain that way.
@michaelnmmeyer , would it be possible to A) auto-replace all hyphens contained in a <citedRange unit="appendix"> that is a child of a <ptr> whose @target starts with bib:ARIE to an en-dash, and B) make sure that the algorithm for identifying plurals is sensitive only to hyphens, and not to en-dashes?
If this solution is feasible, then we could also instruct encoders that in any context, if in the future they want a hyphen in <citedRange> other than mixed, but they don't want it displayed with a plural unit, then they can use an en-dash in place of the hyphen.

@danbalogh
Copy link
Collaborator

danbalogh commented Oct 29, 2024

My current proposal is as follows. I'm numbering each item so that the rest of you can refer to them easily.

  1. keep our present practice of allowing <citedRange> without @unit, always meaning page numbers and only permitted when a <bibl> element has only this one <citedRange> child. I think this has been tacitly accepted all along, and I just wanted to make the constraints very explicit.
    1.A. thus, @unit must be present whenever there is more than one <citedRange> and may be present when there is only one <citedRange> in a <bibl> .
  2. in accordance with Arlo's wish, keep most of the values of @unit currently listed in EGD §10.4.5, but introduce the following changes
    2.A. preferably, get rid of the value "book", which seems to have been introduced specifically to cater to Sircar's Select Inscriptions (wrongly called Indian Inscriptions in the present EGD). I think that a slight change to the definition of "part" could allow us to use "part" for this purpose. We should not need an idiosyncratic label for the sake of one publication, no matter how fundamental.
    2.A.1. If this is accepted, we need a check through the corpus to see if "book" occurs at all, and if it does, whether it happens in any citations other than Sircar's SI.
    2.B. for some of the permitted values, introduce explicit plural labels. I'm thinking especially of part, volume, section, entry, figure, plate, table and appendix, since the numbers of such units are liable to contain hyphens (and possibly commas), which would interfere with machine plural recognition. 4.A below.
    2.C. [edited] Add "line" as a permitted value? This is not in the EGD, but it was listed in Use of <citedRange> in <bibl> #253.
  3. As regards the use of XML elements within <citedRange>, we should explicitly forbid the use of all elements in all cases (definitely including @unit="entry"), with the possible exception of @unit="mixed" where we might choose to allow <foreign> for italicisation.
  4. Display would remain as (I think) it is now, with the following changes:
    4.A. Automated plural detection would only be applied when there is no unit, or when the unit is page, note or item [edit: or "line" if allowing that unit]. For these units, the presence of a hyphen or a comma in the contents of <citedRange> would trigger identification as a plural and the corresponding display of a plural label (pages, etc.).
    4.A.1. Hyphens in <citedRange> with these units can, if desired, be changed to en dashes on display, but this change should not be hard-coded in the files (I think Michaël misunderstood me above on this; what I was suggesting is that where a hyphen is desired but would interfere with plural detection, an en dash could be hard-coded instead of the hyphen. But I think this would not be necessary if we restrict plural detection to these cases.)
    4.B. For all other units, a plural label would only be displayed if the value of @unit is explicitly plural.
    4.C. The contents of <citedRange unit="entry"> would be displayed in italics (with the label "s.v.")
    4.D.1. The contents of <citedRange unit="entries"> (with the label "s.vv.") would likewise be italicised, but for neatness' sake commas, semicolons and spaces following these could be reverted to non-italic in display, if feasible.

Opinions, please.

@manufrancis
Copy link
Collaborator

manufrancis commented Oct 31, 2024

Thanks Daniel!

My general stand is:

I am in favour of straightforward rules with as few exceptions as possible.
I would in general be as explicit as possible so as to avoid machine plural recognition or machine en dash recognition. There will be, I guess, many unrecognised or unforeseeable cases.
I think Michaël should not devote time to "trivial" things such as en dash, italicisation of entries, etc. TIME FLIES.

  1. I would rather avoid such an exception. Which would result in people not using @Unit when they should have to.

2.A. OK. I fully agree with "We should not need an idiosyncratic label for the sake of one publication, no matter how fundamental."

2.B. very OK for explicit plural labels.

  1. OK with no XML elements at all within <citedRange>. If someone feels the need to use <foreign> in <Bibl>, let him italicise words in free-text in the epigraphical lemma.

4.A. I would dispense with automated plural detection (since with 2.B we will have values for explicit plural labels).

4.A.1 I can live without en dash, that is without machine en dash recognition (but those who like it would just need to type them their XMLs).

4.B. very OK.

4.C. OK for italicisation of contents of <citedRange unit="entry"> (the more so as it does not require using <foreign>).

4.D.1 Seems to me a case of idiosyncratic practice for very rare cases. Those who need to refer to several "entries" could just enter as many <Bibl> as they have entries (or, if permitted, but I guess it is not, have several <citedRange unit="entry"> in a <Bibl>) (or be explicit in free-text in the epigraphical lemma).

4.E. [edit: newly added] Instead of explicit plurals, for some or all of the rarer units (to be discussed which), we could instead suggest using <citedRange unit="mixed"> and writing up a free-text citation when a plural is desired.

@danbalogh
Copy link
Collaborator

Thank you, Manu. Some comments on your comments.

  1. NB that this (i.e. no @unit when there is only a page number reference) is what we've been doing so far, and as best I know, it has not caused any problems. I'm OK with forbidding this in the future if that's your preference, but I don't see a need to do so.
    4.A. Are you suggesting of getting rid of all automated plural detection? It seems to work pretty well when it can be reliably based on hyphens and commas, and if we want to forbid it for pages (i.e. require @unit="pages" for page ranges and page lists, then there will be a lot of retroactive changes in the existing files.
    4.A.1. I don't think auto-replacing hyphens to en dashes in those specific circumstances needs a lot of effort from Michaël; actually, since afaik it's already in place, disabling it may take more effort than keeping it.
    4.D.1. Let's see what Arlo thinks about that. On my side, I'm OK with not being very pedantic about citation display. Note that the solution of using multiple <citedRange unit="entry"> elements, though technically not forbidden, would not result in the desired display (instead of "s.vv. alpha, beta" it would give us something like "s.v. alpha, s.v. beta". There is, however, the option of using <citedRange unit="mixed">. Apropos of this, I'm adding a 4.E to the above list.

@arlogriffiths
Copy link
Collaborator Author

arlogriffiths commented Oct 31, 2024 via email

@danbalogh
Copy link
Collaborator

Keeping things as they are is also acceptable to me. My impression is that the main motive for revising the citation system has always been your (@arlogriffiths ) desire to get meticulously styled displays, with plural units when applicable and italics where desired but nowhere else.
As for cost and benefit, as far as I can see, the above recommendation would only require retroactively changing citations involving one of the following:

  • @unit="book" if these exist at all and if we decide to get rid of this unit
  • units other than page, note or item (i.e. part, volume, section, entry, figure, plate, table and appendix) only when these contain a range or series

That's it. I don't think there could be thousands of these in the corpus.
NB, since we have long ago agreed that <citedRange unit="mixed"> is necessary for freetext references (instead of using no @unit as in the old EGD), existing citations without @unit will need some manual rechecking and revision no matter how we choose in the present matter.
So that's it for costs. As for benefits, the above system would allow us to get the nicely formatted "s.vv. alpha, beta" kind of display, and to display plural units correctly, whereas with the present system we'd either have no plural display for most kinds of unit (see here) or we'd need time-consuming and unreliable plural detection.

I'm OK to live with the costs and also OK to live without the benefits, so whatever you PIs can agree on.

@manufrancis
Copy link
Collaborator

I concur with Dan:

I'm OK to live with the costs and also OK to live without the benefits,
so whatever the meticulous PI can agree on.

And I am ready to update accordingly my XML files (and have my team to update theirs).
Maybe we can wait for Michaël opinion on cost/benefit

@manufrancis
Copy link
Collaborator

@arlogriffiths : the ball seems to be in your court.
@michaelnmmeyer : any thoughts on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

4 participants