Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system: data types and URLs #1312

Closed
tiborsimko opened this issue Aug 22, 2017 · 19 comments
Closed

system: data types and URLs #1312

tiborsimko opened this issue Aug 22, 2017 · 19 comments

Comments

@tiborsimko
Copy link
Member

In COD2, all records were stored in MARC data model and exposed via /record/123 kind of URLs using internal integer record IDs.

In COD3, we shall have several different record data types that can be exposed over different URLs.
For example, (1) the glossary terms are now records using their own schema term-v1.0.0.json and are exposed via /term/muon kind of URLs; (2) news articles will use their own schema article-v1.0.0.json and are exposed via /article/how-to-use-cms-vm kind of URLs.

We have decided to split record into several types, for example like this:

  • dataset
  • tool
  • software
  • analysis
  • event
  • guide
  • document
  • ...

They could be exposed using DOI persistent identifier as /dataset/10.7483/OPENDATA.CMS.QKAX.PSW6.

In this RFC we should muse about how many subsets of of the record type we would like to introduce, which URLs we would like to use to expose them, and what kind of visible unique identifier we would like to use in place of record IDs.

Note also the design of the search facets in COD3, where facets could match JSON schema data model could match URL.

P.S. Possible redirections from old URLs /record/123 to new URLs dataset/10.7483xxxxx will be tackled afterwards.

@suenjedt
Copy link
Member

@sefeg

@tiborsimko
Copy link
Member Author

Note also the past collections:

        name = 'CMS'
        name = 'CMS-Primary-Datasets'
        name = 'CMS-Derived-Datasets'
        name = 'ALICE'
        name = 'ALICE-Derived-Datasets'
        name = 'ALICE-Tools'
        name = 'CMS-Tools'
        name = 'CMS-Validated-Runs'
        name = 'CMS-Learning-Resources'
        name = 'ALICE-Reconstructed-Data'
        name = 'ATLAS'
        name = 'ATLAS-Derived-Datasets'
        name = 'ATLAS-Learning-Resources'
        name = 'ATLAS-Tools'
        name = 'LHCb'
        name = 'LHCb-Derived-Datasets'
        name = 'LHCb-Tools'
        name = 'LHCb-Learning-Resources'
        name = 'ALICE-Learning-Resources'
        name = 'CMS-Open-Data-Instructions'
        name = 'Author-Lists'
        name = 'Data-Policies'
        name = 'ATLAS-Higgs-Challenge-2014'
        name = 'CMS-Simulated-Datasets'
        name = 'CMS-Validation-Utilities'
        name = 'CMS-Trigger-Information'
        name = 'CMS-Condition-Data'
        name = 'CMS-Configuration-Files'
        name = 'ATLAS-Simulated-Datasets'
        name = 'OPERA'
        name = 'OPERA-Electronic-Detector-Datasets'
        name = 'OPERA-Emulsion-Detector-Datasets'
        name = 'OPERA-Detector-Events'
        name = 'CMS-Luminosity-Information'

@tiborsimko
Copy link
Member Author

As mentioned IRL, I suggest to carefully distinguish between technical envelope (e.g. dataset) and its semantic meaning (e.g. ATLAS Derived Dataset). The differences in the former are going to be expressed via different JSON Schemas and URLs, the differences in the latter are going to be expressed as additional ""labels" narrowing down the scope of the former. In this point of view, we could have something like:

  • dataset representing a set of data files

    • can be used for collision data, simulated data, derived data, event data... etc
  • software representing a computer code

    • for source code of example analyses and of software tools we use
    • possibly also for VM images, unless we want to split them into "environments" or something
    • possibly also for configuration files such as HLT unless we split these apart as well
  • publication representing traditional publications

    • arXiv articles, published papers, physics analysis summaries, data policies of experiments
    • possibly also for conference presentations and videos, unless we want to single these apart
    • possibly also for author lists and other miscellanea unless we want to split these apart
  • term representing ontology of glossary terms (DONE)

  • article representing blog stories and dynamic guides (DONE)

    • the word "article" may be perhaps confusing with "publication" above, so perhaps invent something like "post" or "page" or "help" or "guide" or something here instead

I think the above five data types may be perhaps enough to start with. They should encode all the various record types we have on COD2.

Just some raw brain dump to illustrate the thinking... Using a perspective that it is better to introduce less categories rather than more, and that it is OK to group some categories together on COD even though they would be probably split on CDS or INSPIRE.

Any thoughts on the above proposal?

@sefeg
Copy link
Member

sefeg commented Aug 23, 2017

The proposal goes in a direction similar to what we just discussed on our side. Based on the current collections, we thought about focusing on the following record types:

  • Datasets

    • Primary, Derived, Simulated, Event data, Electron-Detector, Emulsion-Detector
  • Software / Tools

  • Guides / Tutorials

    • all the learning resources and even content like challenges, that can be presented in one article (the mockup is already designed to link relevant datasets for example)
  • Miscellaneous

    • contains everything else: including news, announcements, blog type posts, data policies, author lists, etc. For those contents that already exist (i.e. Data-Policies), we can just use those collection names as filter elements in the faceting of the miscellaneous type

@tiborsimko this solution allows for some flexibility and extends the mockup by the miscellaneous type. Of course terms is another record type, as you indicated, but i did not list it before, because this does not appear on the page of the collections.

Concerning the URL: We should make them as machine-readable as possible and use the title of the individual records, so e.g. /guides/how-to-analyze-muon-...
Here, we have to define at a later stage the maximum amount of characters used

@tiborsimko
Copy link
Member Author

Miscellaneous [...] contains everything else [...]

I wouldn't like such a "miscellaneous" category, which would look just like a one-size-fits-all bag for everything not fitting elsewhere. It wouldn't be advantageous to introduce such a generic JSON Schema in my eyes, since we wouldn't be able to profit much from required/optional fields there, leading to record validation troubles for various sub-types hidden in this category. Moreover, a URL like /miscellaneous/<doi> would not look that great either 😄 for say data policy records.

If the objects under "miscellaneous" have something in common, then let's rather name this "commonness" fully; and if the objects are of too different nature, then let's rather not associate them in the first place?

Looking at the full triad may be helpful:

object data type URL
LHCb derived datasets dataset-v1.0.0 /dataset/<doi>
CMS 2010 validation code software-v1.0.0 /software/<doi>
muon glossary term term-v1.0.0 /term/muon
ATLAS data policy publication-v1.0.0 /publication/<doi>
CMS VM 2011 how-to guide article-v1.0.0 /article/how-to-use-cms-2011-vm
... ... ...

(Leading to musings such as: OPERA events may be of the "dataset" type, but in this case we'd better mint DOIs for them... and if we want to use EventID as the persistent ID for them, then we'd better make a new dedicated event data type perhaps...Those kind of considerations.)

@pamfilos
Copy link
Member

We should rethink using DOI as an id, since one reason for changing all data-types/urls is to make the URL more user-friendly.

I think using DOIs (for example
/dataset/10.7483/OPENDATA.CMS.CB8H.MFFA or
/dataset/CMS.CB8H.MFFA),

is even harder to remember than the current /record/700

From my side I think slug-ifying the title is going towards a better solution
(e.g /dataset/dimuon-event-information-derived-from-the-run2010b-public-mu-dataset)

or we can muse on something different..

@sefeg
Copy link
Member

sefeg commented Aug 23, 2017

@tiborsimko I agree that miscellaneous might sound devaluating. It could possibly have a more neutral description

From what I see, besides the choice of name, our proposals are not very different. You considered what you called article to contain e.g. guides (which is part of our list)

Then the only other collection that differs in our proposals is publications and miscellaneous. In your previous example, you assigned something that I considered to go into miscellaneous to go to publications. So, the only difference seems to be the naming of one group. If we found a more neutral name for this group, maybe this could present a solution?

@pamfilos true, since we are spending a lot of efforts on making things more user-friendly, displaying human readable URL's should be a goal

@tiborsimko
Copy link
Member Author

Using title-derived slugs for blog posts is a good strategy, e.g. myself I used /article/how-to-use-cms-vm in the original description. This is because we have quite a control over them.

Using title-derived slugs for datasets, software and other such primary content of the site wouldn't be a good strategy though in my eyes. This is becase these titles are not really permanent. We have seen titles change over time as we have been adding content, we had to invent ad-hoc titles for say HLT configuration files that did not have any such "natural" title in people's minds, etc. This will be even more true in the future in the "live platform" stage, where people would have deposit UI to add (and correct) information such as titles even without our intervention.

Hence the use of DOIs (or record IDs) rather than title-derived IDs in the URLs, which has an advantage of better persistency. While we could probably use e.g. CMS dataset names in some cases (like /VBF_ToHToZZTo4L_M-115_7TeV-powheg-pythia6/Summer11LegDR-PU_S13_START53_LV6-v1/AODSIM) as nice and meaningful persistent identifiers, this would not work so well for other cases such as CMS event files that don't have any such agreed standard names (e.g. one record's title currently says Z to ee candidate events for public use which is clearly not a good base for creating a cross-experiment slug of sorts). Hence the usage of DOI, or record ID, as opaque identifiers.

It is a good URN design principle not to use semantically-charged persistent identifiers, but rather use opaque ones, since semantic information may change (such as titles in our case, or departments/groups in case of CDS), while opaque identifiers never do (such as DOI, ARK).

Hence I would carefully differentiate when we can use something like a user-contributed title to generate a persistent URL and when we can not. Because cool URIs don't change 😄

P.S. I guess that our targeted end users don't really remember DOIs or record IDs and type them like that from the head anyway... or at least they shouldn't have to! We have a good and performant search for that. (And other semantically charged URLs such as /collection/cms-derived-datasets.)

@suenjedt
Copy link
Member

OK - to summarize:
we agree on the data types. the aforementioned categories:
dataset, software/tools, guide/tutorials, terms, plus a publications/miscellanous/general group.

Then we have the challenge of the URL/l convention. Machine and human readable it should/could be, as persistent as possible naturally. Currently we only have DOIs for materials that could be cited. Not all content has a DOI which is important to consider if we were to use DOI in the URL. I agree with Tibor's comment, of course: "Hence I would carefully differentiate when we can use something like a user-contributed title to generate a persistent URL and when we can not."

@sefeg
Copy link
Member

sefeg commented Aug 28, 2017

To recap the discussion of Friday afternoon (@tiborsimko, please add anything that seems important to you):

  • We proceed with those five categories: Datasets, Software (/ Tools), Terms, Guides (/ Articles / Tutorials), Publications (/ Documents / ...)

  • Regarding those categories, we have to:

    • Find / Discuss / Decide on final names: No priority for the moment, as this can always be changed easily
    • Ask for input to know, to which record type we should assign current collections like 'CMS-Configuration-Files', 'CMS-Trigger-Information': We should do this soon. This is however not a blocker at the moment, as we will start to focus on Datasets, Software and Guides
  • Where we manually select a meaningful title of a record ourselves (which applies especially for articles), we show a human-readable URL.

@suenjedt
Copy link
Member

@sefeg thanks. Who is responsible to get the aforementioned input on "which record type we should assign current collections like 'CMS-Configuration-Files', 'CMS-Trigger-Information"? We/you can (re)assign the issue.

@sefeg
Copy link
Member

sefeg commented Aug 28, 2017

@suenjedt We (i.e probably @tiborsimko and / or @daslerr and / or me) will need to talk to @katilp about this. We should make a more precise plan tomorrow afternoon

@suenjedt
Copy link
Member

pleaese include @ArtemisLav

@tiborsimko
Copy link
Member Author

This is a part of the "COD3-Data-Model" package, so @ArtemisLav is naturally included 😄

(BTW they will end up being new record types, because configuration information is not really a software, and trigger information with run numbers etc is very specific that we'd like to take advantage of for searching, which needs a particular JSON Schema... I simply used these collections as an example that the proposal above is not exhaustive.)

@daslerr
Copy link

daslerr commented Nov 9, 2017

@sefeg and I have been revisiting this discussion and reviewing some of the things we already done. We've determined that the main distinction we want to make for the "text stuff" content is news vs. everything else, and currently in the record types list we're already using Documentation for everything that isn't news, which makes sense. So, with that in mind, for the URLs, the news items that are currently under /articles should instead be /news and every other text record should be under /docs. Things classed as Documentation will still be able to have secondary types, like Guides and so on, but this won't affect the URLs.

I believe this exploratory part of the discussion can now be closed.

@daslerr daslerr closed this as completed Nov 9, 2017
@tiborsimko
Copy link
Member Author

@daslerr @sefeg The "docs" seems nicer than "articles" or "posts", but there is one thing to consider: if in the future we shall get to splitting up records into datasets, software, configuration etc independent schemas, as discussed in this issue, then please recall that we have been talking about "data-policies" becoming "documentation". So, if we do this, then we would have both "docs" and "documentation", being very close in name, yet describing two different things... We would have to either use a different name for "documentation", or simply merge the two concepts together (but that would require having a feature to "attach files to non-records", as we discussed live). Have you been thinking about the closeness of "/docs/foo" and "/documentation/foo" and how to resolve it, if we split up the "/record/NNN" in the future?

@daslerr
Copy link

daslerr commented Nov 13, 2017

Thanks for reminding me about that discussion. I'd forgotten. I agree that's confusing.

Since data policies are therefore something of a special case, could we instead call the data policies "policies" when we get around to splitting records into types? And then we could still use the "documentation" or "docs" term for what is now known as "articles." Then it's matching a more general term ("documentation") to a more general category (the current "article" bucket). @tiborsimko what do you think?

Between "documentation" and "docs" I have no preference other than that "docs" is shorter.

@tiborsimko
Copy link
Member Author

In my eyes the data policy records don't "deserve" to be a separate entity... similarly to say "author lists", which is another example of the same documentation-like kind of material that we are hosting. Basically, it is a bunch of metadata fields, plus DOI, plus attached PDF, and that's all. One can look at them as "documents". So I'd rather not introduce "/policies/...", "/author-lists/...", "/open-data-instructions/..." etc -- but rather have one schema for them all, and expose them that way. This is why we thought "that /documentation/..." might be a good name here (rather than "/publication/..." that was mentioned in the table above for illustration purposes), as we discussed in #1323.

@daslerr
Copy link

daslerr commented Nov 13, 2017

This was clarified in an offline conversation. We can change the name of /articles/... to /documentation/... now. When the time comes to split out the /record/... namespace, we can merge policies and author lists and other documentation-like records into /documentation/... and update or homogenize the data model as necessary at that time.

Therefore, /articles will now be /documentation and can be renamed everywhere accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants