-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data(base) licenses #102
Comments
If we are not sure, maybe we should postpone this after 1.0 |
@merkys, @giovannipizzi This is marked milestone v1.0. Are we ok to push this to v1.1? I say 'yes'. Right now, if you want to redistribute data you've obtained via OPTIMADE you will have to check with the originator database for license information (their website, email, ...). It would be great to be more helpful than this, but IMO not crucial for 1.0. |
Yes, I agree with @giovannipizzi and @rartino that we can solve this after v1.0. |
I am revisiting this issue after looking at #364, where licensing of archival data is discussed. I would like to build on top of my initial proposal, accommodating subsequent @rartino's comment. Thus:
As for how the license identifier is to be given, one solution would be to make |
@merkys You may also want to include a link to your earlier PR on this (that you eventually closed): #107 To echo my comment there, I'm skeptical to "spread out" license information - in particular licenses of individual entries, as this creates a way to trick users by, e.g., hiding a single more strictly licensed entry among billions of unrestricted ones. I strongly prefer a model where there is only one single place per database to communicate everything license-related. The licensing communicated there could still be complicated and difficult to deal with, e.g., "all bcc structures are under CC-BY, and fcc under GPL3", but as long as that is communicated in a single place, no surprises are hiding in the data. In case someone doubts that anyone would try to use a feature like this maliciously, I'll link this interesting entry by Cory Doctorow on a similar issue that has appeared due to the stipulations in the earlier CC licenses. |
IMO, having clear machine-readable means to identify licenses in finest possible grain is meant to solve exactly the problem you are describing. Surely a top-level licensing file is great to have. However, for aggregate databases this might be difficult to achieve (take Wikimedia Commons for example which has per-file licenses). Thus my proposal has provisions for both the top-level license and per-record licenses. License which says "all bcc structures are under CC-BY, and fcc under GPL3" hides surprises. Suppose the software/human misidentifies some corner cases, providing food for copyright/-left troll.
Thanks for the link, really interesting. However, my take-away from this story is that we as a community need better-worded licenses. |
But, how should end-users handle the per-record licenses when fetching big data sets? Doesn't that mean that we have to spend CPU cycles and bandwidth to verify for every individual entry that the license is as expected? I think there is no way to stop people from using dodgy licenses when publishing data (with OPTIMADE or otherwise) like "all bcc structures are under CC-BY, and fcc under GPL3". But, if there is just one place for such dodginess, I can manually check that place, accept or reject it, and act accordingly. In my opinion, aggregate databases should export data under the strictest subset license and reference the sources for more permissive use. |
I do not think I have a strong opinion one way or the other, but if we do work with licences we should also think about how to handle attribution for each Optimade entry. |
@JPBergsma Individual attribution is indeed very important and - as far as I can see:
However, the |
I guess we could use the |
Sure, but I do not think checking couple millions of strings is much nowadays.
Agree, but I would like to assume good faith here. Surely someone may have a database where "all structures with prime UUIDs are under CC-BY, and proprietary otherwise", but if they put per-entry licenses, the user will not have to rely on prime sieve to see what they can use.
This is surely a safe option, but in my opinion this may drive away users from otherwise permissive data. Moreover, the wording has to be really clear to convey the relation between this encompassing strict and overriding permissive license. |
You don't see a problem with saying that the recommended practice for perfectly normal OPTIMADE use like fetching 1M structures to use in an ML project is to retrieve the structures with the individual Do you think any users of OPTIMADE will actually do this in practice?
This is what the Cory Doctorow link was meant to show: this is the one place where we cannot assume everyone acting in good faith. The original formulations of the CC licenses assumed copyright holders would deal with misattributed copies in good faith - but in response, a whole business pops up trying to get people to misattribute CC:ed works so they can be exhorted/sued. My argument is that it is equally believable that we one day see a business pop up for extorting OPTIMADE users who have accidentally broken a single odd per-entry license. |
No, I do not. Checking 1M strings for computer is still cheaper than person-time spent reading and sorting out complicated license texts (I am not advocating for software lawyers, but most popular licenses should be easy to cite/check). As for the size, we may define
In practice, anyone is free to ignore any license. But I would not recommend to do that.
To me, Cory Doctorow's story tells that that particular CC license was a buggy one. Extortion businesses piggybacking OPTIMADE may arise regardless we add licenses in OPTIMADE responses or not. I believe some (large?) part of OPTIMADE users do not know licenses of individual databases. "Open" does not imply "free", and this is a great opportunity for the cited extortion businesses. A lack of license must not be understood as equivalent to public domain/CC0 as well. By having standardized means to display licenses along OPTIMADE data we would raise the awareness in both users and providers. |
During an in-person discussion with @rartino and @ml-evs I became convinced that entries of more restrictive licenses than the main body of data of an implementation belong to a different "sibling" OPTIMADE implementation. There was also a suggestion to add a binary property |
@ml-evs has pointed out that there might be a need to indicate file licenses in |
@blokhin posted this relevant addition to this discussion in #414
and @merkys responded with:
|
Separating the database into several sections according to a license is not really the best option for the MPDS (losing the holistic view). I’d rather support (2.) Specify all licenses and their governed domains in the top-level license file, but this still unfortunately remains ambiguous and not really useful for the consumer. I can create an additional PR for per-entry licensing as an extension of this thread as well as #414. |
I agree with Evgeny here. For databases that obtain their data from multiple sources, it should be possible to set a per entry licence field. It could be just a key that refers to a licence defined at a higher level. |
I think we somewhat touched on this with our solution @merkys /@rartino , that the overall database license can be complicated (i.e. describing subsets under different licences with a full-text description), which cannot be excluded from any OPTIMADE meta response. I would not be against also having per entry licenses (to cover the use case of @blokhin) provided this overarching license already describes the caveats (and has a field for cc-by compatibility as discussed above). |
@blokhin @JPBergsma Are not the relevant use cases covered by implementing PR #414 with a database-wide license specification, and have databases with per-entry split-license describe this license setup in that link? That also means they get a clear place to explain the terms for a database-specific licensing field such as |
Let's put that the data provider MAY use a per-entry |
We might even add an additional validation procedure taking 10 random entries from a provider and checking if their license is the same as declared in the top-level introspection. |
I'm re-opening this because it was closed automatically with #414, but I think there are aspects remaining that were not completely settled in the discussions here and there. And, to add to the discussion here - having drilled down into the question of what we are going to put in the fields added with #414 for some of our own datasets, we are going to have some datasets (so, OPTIMADE databases) where:
If nothing changes (i.e. #414 remain in place as it is now) I guess we'll just put the above info as our database-wide license. I note however that there will be no way for aggregators to know that (reasonable) retramsmissions of our results are fine. |
During 2023 workshop a question surfaced about how CC-BY 4.0 requirements are supposed to be met by OPTIMADE aggregators. In particular, there is a need to formally attribute a database from which individual entries are "re-translated". A possible solution is to say that retaining original self-links suffices (is this OK with CC-BY 4.0 terms?), but then self-links have to be either REQUIRED, or added by aggregators. |
OPTiMaDe responses should contain the data license indicators. The implementation details depend on the scope of data licensing we want to use:
Which option we choose depends on the nature of databases that will use OPTiMaDe. For instance, COD and TCOD contain only public-domain data, so option 1) would be sufficient. What do others think?
To specify licenses in a standard way I suggest using license names (abbreviations) from SPDX list of commonly used licenses, if we deem it exhaustive enough.Edit: SPDX list has only free licenses, and misses
public-domain
, so not exhaustive enough.The text was updated successfully, but these errors were encountered: