Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request/Idea: Standardize standard license configuration #8512

Closed
philippconzett opened this issue Mar 19, 2022 · 33 comments · May be fixed by #9262
Closed

Feature Request/Idea: Standardize standard license configuration #8512

philippconzett opened this issue Mar 19, 2022 · 33 comments · May be fixed by #9262
Labels
Feature: Admin Guide Feature: Harvesting Feature: Terms & Licensing GREI 6 Connect Digital Objects HERMES related to @hermes-hmc work on Dataverse code Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc. User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh

Comments

@philippconzett
Copy link
Contributor

philippconzett commented Mar 19, 2022

Overview of the Feature Request
With version 5.10, the long-awaited multiple-license support was released (see release notes). Thanks to all contributors!
To better support interoperability between Dataverse installations and beyond Dataverse installations, I'd like to suggest standardizing the way standard license configuration is managed using multiple-license support as follows:

  1. Standardized licenses are provided as authoritative JSON files stored in the IQSS/dataverse GitHub repositories or the GDCC GitHub repository.
  2. The section Adding Licenses in the Dataverse Installation Guide links to GitHub folder containing the JSON files.
  3. We agree on the source and content of the elements in the JSON files. Here are some suggestions and possible issues to discuss:
  • Source: Whenever possible, we use the information provided in the SPDX License List and the license webpage provided by the license issuer.
    JSON elements:
  • name: Field Identifier in the SPDX License List, but without hyphens, e.g., Artistic-2.0 > Artistic 2.0
  • uri: 1) License URI provided by the license issuer; 2) if (1) is not available, SPDX URI for the license
  • shortDescription: Field Full name in the SPDX License List. Question 1: Do we need to add "License" or "Dedication", as is currently done in the JSON files provided in the Dataverse Installation Guide? Question 2: Do we need to add a full stop at the end of the shortDescription element, as is currently done in the JSON files provided in the Dataverse Installation Guide?
  • iconUrl: As provided by license issuer
  1. We agree on which standard licenses to provide as JSON files in the GitHub repository. To start with, I suggest we concentrate on the following ones:
  • Creative Commons Zero 1.0
  • All Creative Commons Attribution (BY) licenses 4.0 and later
  • Open Data Commons Open Database License v1.0
  • Open Data Commons Attribution License v1.0
  • All licenses in the SPDX License List that are FSF Free/Libre and OSI Approved, starting with licenses included in Open Science Framework (OSF):

Content:

  • CC0 1.0 Universal
  • CC-BY Attribution 4.0 International

Code - Permissive:

  • MIT License
  • Apache License 2.0
  • BSD 2-Clause "Simplified" License
  • BSD 3-Clause "New"/"Revises" License

Code - Copyleft:

  • GNU General Public License (GPL) 3.0
  • GNU General Public License (GPL) 2.0

Code - Other:

  • Artistic License 2.0
  • Eclipse Public License 1.0
  • GNU Lesser General Public License (LGPL) 3.0
  • GNU Lesser General Public License (LGPL) 2.1
  • Mozilla Public License 2.0

Following the suggested guidelines above, I have created a Google spreadsheet containing the necessary information to create JSON files, and I created those files by running a bash file. All these documents are available in this Google folder (you might need to log in to access it).

At a later stage, this could of course be automated by retrieving information directly from SPDX and license issuers, possibly via a controlled vocabulary hosted on SKOSMOS.

What kind of user is the feature intended for?
The suggested feature is primarily intended for Sysadmins who need to install licenses on their Dataverse installation.

What inspired the request?
The implementation of multiple license support released in v5.10.

What existing behavior do you want changed?
The different Dataverse installations adding the same standard license with (slightly) different license information.

Any brand new behavior do you want to add to Dataverse?
No, thanks.

Any related open or closed issues to this feature request?
Multiple licences feature proposal #7440

@qqmyers
Copy link
Member

qqmyers commented Mar 19, 2022

FWIW:

  • Files for the CC licenses are already included in v5.10. They do not follow your SPDX advice above as SPDX uses URLs that redirect to their site rather than the CC URLs as defined by CC. My guess is that this is also the case for other licenses and we should probably figure out a best practice for that.
  • Another thing to pay attention to is that SPDX doesn't mark things as obsolete, e.g. https://spdx.org/licenses/CC-PDDC.html doesn't give any indication that the license is deprecated as https://creativecommons.org/licenses/publicdomain/ does (unless you manually click through from SPDX).
  • We currently use http:// URLs for the CC licenses. At one time I think there was a best practice to use http for identifiers regardless of whether access would be https or not. On the CC site, they seem to have a mix, and they do redirect http -> https so both work for access. However, since our database currently has the full URL string, http and https prefixes currently mean the URLs don't match. So - another place where we could/should have a best practice and/or we should change the code to accept either.

@philippconzett
Copy link
Contributor Author

Thanks for your comments, @qqmyers. Just a short reply to your first bullet point. I think the way you have done this in v5.10 is already in line with my suggestion; cf.

uri: 1) License URI provided by the license issuer; 2) if (1) is not available, SPDX URI for the license

I think we only should use the SPDX URI when there is no (authoritative) URI provided by the license issuer.

@poikilotherm poikilotherm added Feature: Terms & Licensing Feature: Admin Guide User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh User Role: Depositor Creates datasets, uploads data, etc. Working Group: SWC HERMES related to @hermes-hmc work on Dataverse code labels Mar 21, 2022
@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 21, 2022

Thank you @philippconzett starting this discussion. This is related to proper future software support (important for our project HERMES), so I'm taking the liberty to join it.

As 5.10 included the first iteration of multi license support, I think we should be very careful when taking the next steps.

Some context about interoperability:
RO-Crate 1.1 uses this JSON-LD schema.org based representation of a license:

{
  "@id": "https://creativecommons.org/licenses/by/4.0/",
  "@type": "CreativeWork",
  "name": "CC BY 4.0",
  "description": "Creative Commons Attribution 4.0 International License"
}

@qqmyers removed the generation of this JSON-LD part from the code and replaced it with the URL only (which is perfectly valid schema.org syntax). The RO-Crate description field is our shortDescription, RO-Crate name stays name, etc. (We might even use our iconUrl for schema.org/RO-Crate thumbnailUrl)

@qqmyers: looking at https://github.com/spdx/license-list-data/blob/master/json/licenses.json, there are licenses marked as deprecated - maybe we need to open an issue at https://github.com/spdx/license-list-XML and talk to them about PDDC being deprecated by upstream (there isn't an issue for this yet).

  • Using the SPDX licenseId field, removing the hyphens to create our/schema.org name field looks fine.
  • Using the SPDX name field to create our shortDescription / schema.org description sounds fine, too. I would not customize this within the data model to be more consistent in UI and exporting places.
  • Looks like the SPDX seeAlso field for a license may provide a good candidate for our uri. Using the SPDX reference without the .html as fallback sounds good. I see this SPDX ID URL is used in RO-Crate, too. (See this examples dataset)
  • WRT to http vs https: maybe we should change the API code to avoid doubled entries and make the service bean accept both for searches etc. IMHO we should be ready to serve mixed styles, too.

(Future: IMHO it would be great to have a summary in our UI, so people don't need to look at license texts. Maybe grabbing the quick summaries from https://tldrlegal.com helps?)

@philippconzett
Copy link
Contributor Author

Thanks for your feedback, @poikilotherm! I wasn't aware that RO-Crate already had addressed this issue. My main concern was just that to make sure that standard licenses are described in the same way across Dataverse installations.

@philippconzett
Copy link
Contributor Author

philippconzett commented Mar 22, 2022

When I mentioned that my suggestion was meant to improve interoperability between Dataverse installations and beyond Dataverse installations, I first of all had in mind that license information from Dataverse installations should be made harvestable in a way that complies with recommendations. I'm not sure about the status of RO-Crate, but a standard that is already implemented and widely used is the DataCite Metadata Schema. The current version of this schema, v.4.4 (cf. https://schema.datacite.org/meta/kernel-4.4/), says the following about license information:

ID DataCite-Property Occ Definition Allowed values, examples, other constraints
16 Rights 0-n Any rights information for this resource. The property may be repeated to record complex rights characteristics. Free text *** Provide a rights management statement for the resource or reference a service providing such information. Include embargo information if applicable. Use the complete title of a license and include version information if applicable. May be used for software licenses. Examples: Creative Commons Attribution; 3.0 Germany License; Apache License, Version 2.02
16.a rightsURI 0-1 The URI of the license. Example: https://creativecommons.org/licenses/by/3.0/de/
16.b rightsIdentifier 0-1 A short, standardized version of the license name. Example: CC-BY-3.0. A list of identifiers for commonly-used licenses may be found here: (https://spdx.org/licenses/).
16.c rightsIdentifierScheme 0-1 The name of the scheme. Example: SPDX
16.d schemeURI 0-1 The URI of the rightsIdentifierScheme. Example: https://spdx.org/licenses/

As the license identifier, DataCite requires "a short, standardized version of the license name", and they suggest to use the SPDX identifier.

Based on the DataCite recommendations, I've updated the Google spreadsheet (see tab "English v.0.2") and the JSON files for the standard licenses I suggest we should provide on GitHub; see this Google folder.

As far as I can see, none of the standard licenses I suggest we should provide on GitHub are obsolete, so this shouldn't be a show stopper. Also pinging @janvanmansum for feedback.

@philippconzett
Copy link
Contributor Author

Here are two JSON examples created following the suggested workflow above:

{
  "rightsName": "CC0 1.0",
  "rightsURI": "https://creativecommons.org/publicdomain/zero/1.0/",
  "rightsIdentifier": "CC0-1.0",
  "rightsIdentifierScheme": "SPDX",
  "schemeURI": "https://spdx.org/licenses/",
  "rightsShortDescription": "Creative Commons Zero v1.0 Universal.",
  "rightsIconUrl": "https://licensebuttons.net/p/zero/1.0/88x31.png",
  "rightsActive": true
}

{
  "rightsName": "CC BY 4.0",
  "rightsURI": "https://creativecommons.org/licenses/by/4.0/",
  "rightsIdentifier": "CC-BY-4.0",
  "rightsIdentifierScheme": "SPDX",
  "schemeURI": "https://spdx.org/licenses/",
  "rightsShortDescription": "Creative Commons Attribution 4.0 International.",
  "rightsIconUrl": "https://licensebuttons.net/l/by/4.0/88x31.png",
  "rightsActive": true
}

@qqmyers @pdurbin I guess we might have to change back some of the field names, in order to this not messing up your current setup, e.g., rightsName >> name?

I don't know what needs to be done to discuss this further, but I'd be happy to contribute as suggested above. For example, if you create a suitable place on GitHub, I could create and upload the JSON files, once we've agreed on how they should look like. Thanks!

@pdurbin
Copy link
Member

pdurbin commented Mar 28, 2022

I don't know what needs to be done to discuss this further

@philippconzett I'm not sure either. Perhaps we can try to make the problem more concrete with a scenario and a screenshot.

Imagine a future where you're harvesting datasets from another Dataverse installation with slightly different names. Also imagine that there's a search facet called "License" that makes these differences obvious at a glance:

Screen Shot 2022-03-28 at 2 02 21 PM

Once the data is in a facet like this, it's obvious that there's a problem, that counts of the same license should be combined.

@philippconzett
Copy link
Contributor Author

Thanks, @pdurbin, and sorry for my late reply.

The scenario you described above is definitely an example of what might be an undesired result of the current way of configuring standard licenses. A similar situation could arise in search engines supporting search/filtering based on license information, e.g., in the advanced search of BASE (https://www.base-search.net/Search/Advanced); cf. this mock-up screenshot:

image

In general, I think we should aim at providing license information along the recommendations of DataCite.

I'd be happy to create a pull request, but I need some help:

  • Maybe @jggautier could review the proposal above to make sure it's in line with best metadata practices?
  • Where should the JSON files be uploaded? I guess the GDCC GitHub org would be a good place, but which one to choose, https://github.com/gdcc or https://github.com/GlobalDataverseCommunityConsortium? Should we create a repository similar to the https://github.com/GlobalDataverseCommunityConsortium/dataverse-language-packs? @qqmyers?
  • Apart from changing the Configuring Licenses section of the Installation Guide to point users to the new GitHub repository (see previous point), I guess the database will also need to be changed since the license information is stored there? I don't know what these changes might be and how to add them to a pull request, but I guess some new table fields/columns would need to be added, e.g., rightsIdentifier, rightsIdentifierScheme, schemeURI, and probably the name of some existing fields need to be changed, e.g., uri >> rightsURI.

I suggest we make this a prioritized PR because the longer we wait, the more likely it becomes that installations configure multi-license support with the current set-up, which means that they would have to do some clean up to change the license information to be aligned with the standardized way suggested in this issue.

@pdurbin
Copy link
Member

pdurbin commented Apr 26, 2022

@philippconzett thanks. If the goal is to keep the Dataverse community together perhaps the best place for the JSON files is where they already are, in the main repo. That way, they seem more official, they can be part of the guides, and if the JSON structure needs to evolve (new fields/columns like you say), it can happen in the same pull request as the code and database changes.

I guess what I'm saying is, what if we consider the licenses in the main repo official already? And if we don't like something about them (they need more or different fields), what if we let them evolve in the main repo, at least for a while?

There are currently 453 licenses in your spreadsheet. If we were start adding more licenses to the main repo, would you want all of them at once? (Do you plan to present all 453 to your users?) A subset? How many? Thanks. For others, here's a link to your spreadsheet: https://docs.google.com/spreadsheets/d/1f_-z6vWijOvIc0tI1ezWeDEgM3U9w5qynllfyNqWYU8/edit?usp=sharing

@philippconzett
Copy link
Contributor Author

Thanks, Phil!

Keeping the JSON files in the main repo sounds reasonable.

As for the number of licenses/JSON files, I only suggest to start with a small selection, as described above; see point 4 in the first posting. These 28 licenses are all marked with "true" in column M (=active) in the spreadsheet. I have now sorted the spreadsheet to make them appear on top. The JSON files of these licenses are in the folder "JSON files v.0.2" in the share Google folder: https://drive.google.com/drive/folders/11BF5tZ9K_S0rxrWErFQYgSCX_geQtHtq?usp=sharing.

@jggautier
Copy link
Contributor

Thanks for pinging me @philippconzett. This issue reminds me of that "things, not strings" saying, which I think is usually used when talking about knowledge graphs, but it makes sense here. I think your idea in this issue will improve the chances that most Dataverse installations will use the same strings to describe the same things.

I'm less sure it would improve interoperability "beyond Dataverse installations". What if, when a Dataverse repository that prefers displaying a "CC-0" license as "CC 0" harvests metadata from a source that uses "CC0", the Dataverse software could figure out that "CC0" is the same thing as "CC-0" and use that when displaying search results (like as facets)? Since the Dataverse software doesn't have facets for the Terms metadata, this problem isn't as noticeable now, so maybe we can cross that bridge when we get to it.

@djbrooke
Copy link
Contributor

djbrooke commented May 9, 2022

Hi all! I hope everyone is doing well.

I noted a similar problem in a different community, and just as a point of information it may be interesting to follow how they solve it: huggingface/datasets#4298

@philippconzett
Copy link
Contributor Author

Thanks, @jggautier + @djbrooke!

@jggautier I'm not sure I agree with you on interoperability beyond Dataverse installations. In my understanding, the main point with the DataCite Metadata Schema recommendations is to make harvested metadata interoperable. Of course, Dataverse, Dataverse installations or DataCite could create crosswalks/scripts to transform the exposed metadata into the desired DataCite format, but why not make the metadata available in a DataCite-aligned way to start with?

I now realize that starting a discussion like this on GitHub is no good idea, as only a few people in the community systematically review GitHub issues. I'll raise the issue in the Dataverse Google group, because I think DataCite-aligned metadata is important for many Dataverse installations. Thanks!

@poikilotherm
Copy link
Contributor

poikilotherm commented May 14, 2022

Please note, as I recently learned, that the Datacite Metadata Export exposed via OAI-PMH is not valid XML. The export also uses an outdated schema and a subset of the schemas possibilities (example is #7077).

I agree with you we should discuss this somewhere else to include more people's views.

@philippconzett
Copy link
Contributor Author

I've raised the issue in the Dataverse Google group: https://groups.google.com/u/1/g/dataverse-community/c/4qSr0mkcyOw.

@philippconzett
Copy link
Contributor Author

philippconzett commented May 29, 2022

I'm adding another illustration of why this feature request should be prioritized: Metadata from Dataverse-based repositories are currently not correctly harvested by DataCite. This includes the license information. So, if you compare a DataCite metadata record from let's say Pangaea, e.g., https://search.datacite.org/works/10.1594/pangaea.940188, you can download the metadata in different formats, and you'll find correct license information:

"rightsList": [
    {
      "rights": "Creative Commons Attribution 4.0 International",
      "rightsUri": "https://creativecommons.org/licenses/by/4.0/legalcode",
      "schemeUri": "https://spdx.org/licenses/",
      "rightsIdentifier": "cc-by-4.0",
      "rightsIdentifierScheme": "SPDX"
    }

Based on this license information, the metadata are then harvested and indexed in other discovery services, e.g., Primo (see this discussion thread in the Dataverse Google group).

On the other hand, Dataverse-based repositories do not expose license information in the way DataCite expects, and thus the DataCite metadata records from Dataverse-based repositories are lacking license information. Here's an example from DataverseNO, and here's one from DataverseNL (@janvanmansum @4tikhonov), here one from the Australian Data Archive (@stevenmce), here one from Harvard Dataverse (@pdurbin @jggautier), here one from Jülich DATA (@poikilotherm), here one from Odum (@donsizemore), and here one from Scholars Portal (@amberleahey @kaitlinnewson @meghangoodchild). As you see (cf. the DataCite JSON file), the rightslist is empty:

"rightsList": [],

As a result, if you search for data in Dataverse-based repositories in discovery services like Primo, you'll be told that you cannot access these datasets. There reason for this being that these services don't have access to the license information of these datasets and assume the are not Open Access.

@qqmyers
Copy link
Member

qqmyers commented May 29, 2022

Dataverse does not send any rights information to Datacite - I believe it is the same as the datacite.xml metadata export. If we sent what we have now, it would be an improvement.

@pdurbin
Copy link
Member

pdurbin commented Jan 23, 2023

@philippconzett don't worry, your PR is still on the global backlog board:

@pdurbin
Copy link
Member

pdurbin commented Apr 11, 2024

JP and I just wrote some guidance on adding licenses the future: #10426 (comment)

Please take a look and let us know what you think!

@philippconzett
Copy link
Contributor Author

Closing this issue in favor of #10883.

@github-project-automation github-project-automation bot moved this from 🔍 Interest to Done in Recherche Data Gouv Sep 26, 2024
@github-project-automation github-project-automation bot moved this from High priority to Closed in DataverseNO Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Admin Guide Feature: Harvesting Feature: Terms & Licensing GREI 6 Connect Digital Objects HERMES related to @hermes-hmc work on Dataverse code Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc. User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh
Projects
Status: Closed
Status: Interested
Status: Done 🧹
Status: Interesting/To keep an eye on
Status: Done
Development

Successfully merging a pull request may close this issue.

10 participants