Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts #320

Closed
michaelnmmeyer opened this issue Jun 19, 2024 · 5 comments
Closed

Scripts #320

michaelnmmeyer opened this issue Jun 19, 2024 · 5 comments
Assignees

Comments

@michaelnmmeyer
Copy link
Member

We have a hierarchical classification of scripts on opentheso, as follows:

Kharoṣṭhī
Brāhmī
...Southern Brāhmī
......Vaṭṭeḻuttu
......Telugu
......Tamil
...Southeast Asian Brāhmī
......Sundanese
......Pyu
......Old West Javanese
...Northern Brāhmī
......Siddhamātr̥kā
......Nāgarī
......Mon-Burmese
......Khmer
......Kawi
......Kannada
......Grantha
......Gauḍī
......Chinese
......Cam
......Bhaikṣukī
......Batak
......Balinese
Arabic
...Jawi

The schema allows in @rendition scripts that have subcategories (Brāhmī, Southern Brāhmī, Southeast Asian Brāhmī, Northern Brāhmī and Arabic). Some of them are used. We have the following frequency distribution:

    864 Tamil
    767 Khmer
    635 Grantha
    544 Southern Brāhmī
     99 Cam
     58 Vaṭṭeḻuttu
     52 Kawi
     13 Kannada
      5 Brāhmī
      3 Undetermined
      1 Telugu
      1 Southeast Asian Brāhmī

This is too inconvenient for machine processing. I am thinking about faceted search, in particular. If, for instance, we want to figure out the number of inscriptions in Southern Brāhmī, it is necessary to count recursively: the answer is count(Northern Brāhmī) + count(Vaṭṭeḻuttu) + count(Telugu) + count(Tamil). The hierarchy is also not encoded in the schema.

I would be much happier if we used a flat list of scripts (as for languages). For instance, the following:

Southern Brāhmī
...Vaṭṭeḻuttu
...Telugu
...Tamil

... could be transformed to:

Vaṭṭeḻuttu
Telugu
Tamil
Southern Brāhmī

... where "Southern Brāhmī" means "any Southern Brāhmī script that is not Vaṭṭeḻuttu, Telugu or Tamil". Can we agree on this interpretation?

@danbalogh
Copy link
Collaborator

In my opinion, the list should absolutely not be flattened. You say that with the present list, if "we want to figure out the number of inscriptions in Southern Brāhmī, it is necessary to count recursively: the answer is count(Northern Brāhmī) + count(Vaṭṭeḻuttu) + count(Telugu) + count(Tamil)". In other words, counting the inscriptions in Southern Brāhmī is complicated but possible. With a flat list, counting the inscriptions in Southern Brāhmī would be simply impossible, since there would be no indication that Vaṭṭeḻuttu etc. are also kinds of Southern Brāhmī. Am I missing something here?

I do not understand what you mean by "The hierarchy is also not encoded in the schema." Why should it be and how could it be? It's an OpenTheso vocabulary, and the hierarchy is encoded there. If there is a problem in getting faceted search to "talk to" OpenTheso, then this should be worked out between you and Adeline.

One way I can think of to keep the cake and eat it would be to permit more than one "class" in @rendition, and to batch replace existing classes referring to a subcategory with references to both the higher category and the subcategory. But I find it hard to believe that there does not exist a simpler way.

@michaelnmmeyer
Copy link
Member Author

What I am suggesting is a slight change in semantics. Currently, you can, for
instance, assign the script Southern Brāhmī to an inscription in Tamil. This is
valid, since Tamil is a descendant of Southern Brāhmī. But there is an overlap
between these two categories, so you cannot really know how many inscriptions
are in Tamil. If 5 inscriptions are assigned the Tamil script and 3 inscriptions
are assigned Southern Brāhmī, you can only tell that the actual number of
inscriptions in Tamil is between 5 and 8.

Instead of this, I propose to remove Southern Brāhmī from the list of categories
you can choose when encoding inscriptions, and to replace it with another
category named, for instance, "Other Southern Brāhmī". This category would be
used for inscriptions that are in some Southern Brāhmī script except the ones
that are explicitly enumerated, viz. Vaṭṭeḻuttu, Telugu and Tamil. What matters for
me is that there is no overlap between the categories you can choose when encoding
an inscription.

Take the following hierarchy, for instance:

Brāhmī
...Southern Brāhmī
......Vaṭṭeḻuttu
......Telugu
......Tamil
...Southeast Asian Brāhmī
......Sundanese
......Pyu
......Old West Javanese

This would be replaced with:

Brāhmī
...Southern Brāhmī
......Vaṭṭeḻuttu
......Telugu
......Tamil
......Other Southern Brāhmī
...Southeast Asian Brāhmī
......Sundanese
......Pyu
......Old West Javanese
......Other Southeast Asian Brāhmī
Other Brāhmī

And the schema would only allow you to choose between these categories:

Vaṭṭeḻuttu
Telugu
Tamil
Other Southern Brāhmī
Sundanese
Pyu
Old West Javanese
Other Southeast Asian Brāhmī
Other Brāhmī

@danbalogh
Copy link
Collaborator

I understand perfectly what you are proposing and I repeat: I do not consider this acceptable, for the reasons I explained above.

Tamil script is a kind (and not a descendant) of Southern Brāhmī and the same applies (mutatis mutandis) to all of the lower-level categories. A search to retrieve inscriptions in any kind of southern Brāhmī (including Tamil and all other subclasses) is a meaningful search that users might want to do, but it will become impossible in the scheme you propose, unless the search employs tick boxes for the script classes and the user must tick each kind of southern Brāhmī to accomplish that search.

There is another reason, namely that script classes are pretty fuzzy. Tamil may be a special case, always recognisable to Tamilists as clearly and unequivocally different from non-Tamil (though I doubt that), but most of the other subclasses don't have a clear boundary where they begin. For instance, the script of some of my Eastern Cālukya inscriptions could arguably be labelled as Telugu. I don't think I've ever used that script class, but if I ever do (for one of my late inscriptions), that will not mean that there is an essential difference between the script of that inscription and the script of another one, say 50 years earlier, which has been labelled with the generic label. The lowest-level script classes exist because some texts use specifically nameable scripts. This is not so with the majority of inscriptions. A Tamilist working on an inscription would not classify its script as "southern Brāhmī" if they identified the script as Tamil. The higher-level labels are to be used when a lower level label does not apply unequivocally (see the Memo on Controlled Vocabularies), and not optionally or randomly. The situation where "you cannot really know how many inscriptions are in Tamil" does not arise, because an inscription written in (unequivocal) Tamil will be labelled as such by its encoder and because when you zoom in on the fuzzy boundary between Tamil and non-Tamil, the question ceases to be meaningful.

Finally, I should note that the hierarchy is important not only between the middle and lowest levels, but also between the highest and middle levels. Restricting a search to Brāhmī inscriptions (i.e. including all kinds of Brāhmī while excluding e.g. Arabic, Chinese or - in the future, if our system remains in use, Kharoṣṭhī) is a meaningful thing and a valuable research tool which would likewise be impossible (or very tedious) in a flat hierarchy.

@michaelnmmeyer
Copy link
Member Author

The higher-level labels are to be used when a lower level label does not apply unequivocally (see the Memo on Controlled Vocabularies), and not optionally or randomly.

So in fact we are agreeing. I can work with that.

@danbalogh
Copy link
Collaborator

For the record, I see no agreement here as regards the topic of this thread, only in the functional detail that the token for a higher hierarchical class will only be used in encoding practice when none of the lower classes is applicable. If that satisfies you, then well and good. But this absolutely does not mean that the conceptual hierarchy is or can be flat. The higher categories incorporate their descendants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants