-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
system: data types and URLs #1312
Comments
Note also the past collections:
|
As mentioned IRL, I suggest to carefully distinguish between technical envelope (e.g. dataset) and its semantic meaning (e.g. ATLAS Derived Dataset). The differences in the former are going to be expressed via different JSON Schemas and URLs, the differences in the latter are going to be expressed as additional ""labels" narrowing down the scope of the former. In this point of view, we could have something like:
I think the above five data types may be perhaps enough to start with. They should encode all the various record types we have on COD2. Just some raw brain dump to illustrate the thinking... Using a perspective that it is better to introduce less categories rather than more, and that it is OK to group some categories together on COD even though they would be probably split on CDS or INSPIRE. Any thoughts on the above proposal? |
The proposal goes in a direction similar to what we just discussed on our side. Based on the current collections, we thought about focusing on the following record types:
@tiborsimko this solution allows for some flexibility and extends the mockup by the miscellaneous type. Of course terms is another record type, as you indicated, but i did not list it before, because this does not appear on the page of the collections. Concerning the URL: We should make them as machine-readable as possible and use the title of the individual records, so e.g. /guides/how-to-analyze-muon-... |
I wouldn't like such a "miscellaneous" category, which would look just like a one-size-fits-all bag for everything not fitting elsewhere. It wouldn't be advantageous to introduce such a generic JSON Schema in my eyes, since we wouldn't be able to profit much from required/optional fields there, leading to record validation troubles for various sub-types hidden in this category. Moreover, a URL like If the objects under "miscellaneous" have something in common, then let's rather name this "commonness" fully; and if the objects are of too different nature, then let's rather not associate them in the first place? Looking at the full triad may be helpful:
(Leading to musings such as: OPERA events may be of the "dataset" type, but in this case we'd better mint DOIs for them... and if we want to use EventID as the persistent ID for them, then we'd better make a new dedicated event data type perhaps...Those kind of considerations.) |
We should rethink using DOI as an id, since one reason for changing all data-types/urls is to make the URL more user-friendly. I think using DOIs (for example is even harder to remember than the current From my side I think slug-ifying the title is going towards a better solution or we can muse on something different.. |
@tiborsimko I agree that miscellaneous might sound devaluating. It could possibly have a more neutral description From what I see, besides the choice of name, our proposals are not very different. You considered what you called article to contain e.g. guides (which is part of our list) Then the only other collection that differs in our proposals is publications and miscellaneous. In your previous example, you assigned something that I considered to go into miscellaneous to go to publications. So, the only difference seems to be the naming of one group. If we found a more neutral name for this group, maybe this could present a solution? @pamfilos true, since we are spending a lot of efforts on making things more user-friendly, displaying human readable URL's should be a goal |
Using title-derived slugs for blog posts is a good strategy, e.g. myself I used Using title-derived slugs for datasets, software and other such primary content of the site wouldn't be a good strategy though in my eyes. This is becase these titles are not really permanent. We have seen titles change over time as we have been adding content, we had to invent ad-hoc titles for say HLT configuration files that did not have any such "natural" title in people's minds, etc. This will be even more true in the future in the "live platform" stage, where people would have deposit UI to add (and correct) information such as titles even without our intervention. Hence the use of DOIs (or record IDs) rather than title-derived IDs in the URLs, which has an advantage of better persistency. While we could probably use e.g. CMS dataset names in some cases (like It is a good URN design principle not to use semantically-charged persistent identifiers, but rather use opaque ones, since semantic information may change (such as titles in our case, or departments/groups in case of CDS), while opaque identifiers never do (such as DOI, ARK). Hence I would carefully differentiate when we can use something like a user-contributed title to generate a persistent URL and when we can not. Because cool URIs don't change 😄 P.S. I guess that our targeted end users don't really remember DOIs or record IDs and type them like that from the head anyway... or at least they shouldn't have to! We have a good and performant search for that. (And other semantically charged URLs such as |
OK - to summarize: Then we have the challenge of the URL/l convention. Machine and human readable it should/could be, as persistent as possible naturally. Currently we only have DOIs for materials that could be cited. Not all content has a DOI which is important to consider if we were to use DOI in the URL. I agree with Tibor's comment, of course: "Hence I would carefully differentiate when we can use something like a user-contributed title to generate a persistent URL and when we can not." |
To recap the discussion of Friday afternoon (@tiborsimko, please add anything that seems important to you):
|
@sefeg thanks. Who is responsible to get the aforementioned input on "which record type we should assign current collections like 'CMS-Configuration-Files', 'CMS-Trigger-Information"? We/you can (re)assign the issue. |
@suenjedt We (i.e probably @tiborsimko and / or @daslerr and / or me) will need to talk to @katilp about this. We should make a more precise plan tomorrow afternoon |
pleaese include @ArtemisLav |
This is a part of the "COD3-Data-Model" package, so @ArtemisLav is naturally included 😄 (BTW they will end up being new record types, because configuration information is not really a software, and trigger information with run numbers etc is very specific that we'd like to take advantage of for searching, which needs a particular JSON Schema... I simply used these collections as an example that the proposal above is not exhaustive.) |
@sefeg and I have been revisiting this discussion and reviewing some of the things we already done. We've determined that the main distinction we want to make for the "text stuff" content is news vs. everything else, and currently in the record types list we're already using I believe this exploratory part of the discussion can now be closed. |
@daslerr @sefeg The "docs" seems nicer than "articles" or "posts", but there is one thing to consider: if in the future we shall get to splitting up records into datasets, software, configuration etc independent schemas, as discussed in this issue, then please recall that we have been talking about "data-policies" becoming "documentation". So, if we do this, then we would have both "docs" and "documentation", being very close in name, yet describing two different things... We would have to either use a different name for "documentation", or simply merge the two concepts together (but that would require having a feature to "attach files to non-records", as we discussed live). Have you been thinking about the closeness of "/docs/foo" and "/documentation/foo" and how to resolve it, if we split up the "/record/NNN" in the future? |
Thanks for reminding me about that discussion. I'd forgotten. I agree that's confusing. Since data policies are therefore something of a special case, could we instead call the data policies "policies" when we get around to splitting records into types? And then we could still use the "documentation" or "docs" term for what is now known as "articles." Then it's matching a more general term ("documentation") to a more general category (the current "article" bucket). @tiborsimko what do you think? Between "documentation" and "docs" I have no preference other than that "docs" is shorter. |
In my eyes the data policy records don't "deserve" to be a separate entity... similarly to say "author lists", which is another example of the same documentation-like kind of material that we are hosting. Basically, it is a bunch of metadata fields, plus DOI, plus attached PDF, and that's all. One can look at them as "documents". So I'd rather not introduce "/policies/...", "/author-lists/...", "/open-data-instructions/..." etc -- but rather have one schema for them all, and expose them that way. This is why we thought "that /documentation/..." might be a good name here (rather than "/publication/..." that was mentioned in the table above for illustration purposes), as we discussed in #1323. |
This was clarified in an offline conversation. We can change the name of Therefore, |
In COD2, all records were stored in MARC data model and exposed via
/record/123
kind of URLs using internal integer record IDs.In COD3, we shall have several different record data types that can be exposed over different URLs.
For example, (1) the glossary terms are now records using their own schema
term-v1.0.0.json
and are exposed via/term/muon
kind of URLs; (2) news articles will use their own schemaarticle-v1.0.0.json
and are exposed via/article/how-to-use-cms-vm
kind of URLs.We have decided to split
record
into several types, for example like this:They could be exposed using DOI persistent identifier as
/dataset/10.7483/OPENDATA.CMS.QKAX.PSW6
.In this RFC we should muse about how many subsets of of the
record
type we would like to introduce, which URLs we would like to use to expose them, and what kind of visible unique identifier we would like to use in place of record IDs.Note also the design of the search facets in COD3, where facets could match JSON schema data model could match URL.
P.S. Possible redirections from old URLs
/record/123
to new URLsdataset/10.7483xxxxx
will be tackled afterwards.The text was updated successfully, but these errors were encountered: