-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve/update Schema.org JSON-LD export #7349
Comments
For adding conditionsOfAccess for the file metadata, should the values be binary, e.g. open and close, like the following?:
|
@jggautier if you plan to use binary, maybe this property is more appropriate? https://schema.org/isAccessibleForFree |
That makes sense to me! I think that if we use that property this way, since you've been following this issue closely, you (and the tools you're helping develop) will know what it means for a file to be "isAccessibleForFree". Hopefully others who need to use this metadata will also be able to figure how it's being used. The Google Research group writes on page 3 of their "Google Dataset Search by the Numbers" article that the property "is a boolean value that indicates whether or not the dataset requires a payment", but then they describe how Google Dataset Search interprets a True value to mean "open" and similar to any of the "Creative Commons and open government licenses". So I think it's fair to expect that their interpretation, applied at the dataset level, should be applied at the file level, too, right? So it shouldn't be hard for others who need to use this metadata to figure out that a file flagged as "isAccessibleForFree" is open to some degree, although the exact degree (programmatic access to the file) might not be apparent by just looking at the metadata. |
what is the btw, the tool accepts both schema.org properties (accessibleForFree, conditionsOfAccess) which may be used to indicate access-level metadata of a dataset. |
Isn't "DataDownload" the I meant more that if I was looking to use the metadata to build a tool or query the repository and saw isAccessibleForFree: True (or False) in the datasets' Schema.org metadata, I wouldn't know what that means exactly. For example, you mentioned earlier that Pangea uses isAccessibleForFree and I can see it in the schema.org metadata for this dataset, but to figure out what that means, I'd have to find information that's not present in the metadata itself. The page for that Pangea dataset says I need to be logged in to download the data, but Pangea says elsewhere that downloading most of their datasets' files doesn't require login, like the dataset at https://doi.pangaea.de/10.1594/PANGAEA.921541, whose Schema.org metadata has isAccessibleForFree: True. So now I'm thinking that isAccessibleForFree is True for Pangea datasets if I don't have to log in to download the data. But I can't determine this by just looking at the Schema.org metadata. Does this make the metadata less FAIR? The definition of the isAccessibleForFree property doesn't define what free means. But maybe it's okay to expect people who need to programmatically determine a file's access level to do a little investigation into what free means in this context, or, if it's already common practice to use isAccessibleForFree the way we've proposed (Pangea, and maybe other repositories, seem to be using it this way already) it's okay to expect that people should assume that when data repositories use isAccessibleForFree for data files, that means either there is one or more barriers to accessing the file (isAccessibleForFree:False) or there are no barriers (isAccessibleForFree:True). |
yup, Thing > CreativeWork > MediaObject > DataDownload, so the property can be used with DataDownload. |
For pangaea, all public datasets are set with isAccessibleForFree = True, the rest restricted datasets (embargoed, requires login) are set to False. In addition to the this property, we also use the 'conditionsOfAccess' property to communicate access data level. @ashepherd, can you please let us know the way you specifiy data access level at science-on-schema.org? |
Speaking of science-on-schema.org, RDA's Research Metadata Schemas WG announced updated guidelines from the ESIP Schema.org cluster for using Schema.org to describe data. It's at https://github.com/ESIPFed/science-on-schema.org (and is summarized in the RDA WG's own report). Guidelines for describing datasets specifically are at https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md. From a quick look it seems like the guide includes ways to add metadata that Dataverse isn't mapping to its Schema.org export and using different elements and structures to include more metadata. When we tackle this issue (updating Dataverse's Schema.org export), I think we should learn how in line these guidelines are with the FAIRsFAIR's testing tools. |
@jggautier I skimmed through guidelines, the recommended fields suggested in the guidelines are currently being considered by F-UJI when evaluating a dataset except 1.catalog 2. linking physical samples to dataset. In any case, i will cross-check again the schema.org mappings captured as part of the tool with the recommendations from ESIP. @https://github.com/huberrob |
DataONE hosted a community call on "Science on Schema.org Guidelines and Experiences" (https://www.dataone.org/community-calls/soso/). Collaborative notes from the meeting are posted at https://github.com/DataONEorg/community-calls/blob/master/notes/20210401_call_notes.md. |
Just putting additional information that license's "@type" should be "CreativeWork" not "Dataset", based on our Rich Results Test. https://support.google.com/webmasters/thread/146534613?hl=en&msgid=146553381#action=helpful |
Some additional things that we're finding based on google's validation:
We'd be interested in working on all of these. I think the only contentious one is #5029, so if we could come to a decision on what to do there we could wrap this all in one PR |
@jggautier and I have what we think is a good way forward on #5029 , so I think this is pretty doable and we'll try to put it onto our roadmap at QDR. |
Related (possibly a duplicate or sub-issue): |
In a meeting with folks from the FAIRsFAIR group (namely @kitchenprinzessin3880) who are building and testing tools to access the "FAIRNESS" of datasets in Dataverse repositories (https://www.fairsfair.eu/fairsfair-data-object-assessment-metrics-request-comments), some changes were recommended for the metadata that Dataverse includes in the Schema.org JSON-LD metadata it exports for datasets. I said I'd open a Github issue so we could record and explain these changes.
For license property, use the
@type
"CreativeWorks" and use "name" instead of "text":As of Dataverse 5.1.1, the
@type
for the Schema.org property "license" is "Dataset". Here's an example of what that looks like:or if CC0 is waived:
Google's guide for describing datasets with Schema.org says to use the "CreativeWorks"
@type
for license and use "name".Here's an example of what the license metadata in the Schema.org export might look like when this issue is merged (after the "multiple license" work described at #7440 and #7742 is also merged):
If the dataset depositor chooses a license from the list of licenses:
Or if no license is chosen and a custom license is entered:
For files (in the "distribution" property):
As of Dataverse 5.1.1, here's an example of what the file metadata in the Schema.org export looks like:
Here are the changes related to file metadata being proposed in this GitHub issue:
Use "encodingFormat" instead of "fileFormat":
Google's guide for describing datasets with Schema.org says to use the property "encodingFormat" (doesn't mention using the "fileFormat" property)
contentURL should always be added:
As of Dataverse 5.1.1, Dataverse puts each file's "download URL" in Schema.org's contentURL property as long as the file isn't restricted or its dataset has no guestbook or Terms of Use metadata. (See details about the current logic at As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371 (comment))
Instead, Dataverse should always include every file's "download URL" in Schema.org's contentURL property. Then if the file is restricted or its dataset has a guestbook or Terms of Access metadata, the download URL will return the access restricted error that it returns now.
Add conditionsOfAccess to declare that a file is open or restricted:
@kitchenprinzessin3880 pointed to two vocabularies whose terms we might consider using as values for conditionsOfAccess, to indicate how accessible the file is: https://guidelines.openaire.eu/en/latest/literature/field_accesslevel.html and http://vocabularies.coar-repositories.org/documentation/access_rights.
Each vocab defines four terms. I've written in Access Rights metadata in OpenAIRE metadata export is being misapplied #5920 about current problems Dataverse has with using the Access Rights terms from the info:eu-repo namespace, so I'm hesitant to use those terms. To put it briefly, Dataverse has files that are restricted using Dataverse's file restriction feature and the "File Request" feature is disabled, but the depositor uses a process outside of Dataverse to manage access to the file. So the file is restricted, not "closedAccess," even though people aren't able to request access to the file through Dataverse's "File Request" feature. Most of the datasets in Harvard Dataverse's Murray collections are like this (e.g. there's a process outside of the Dataverse software for requesting access to restricted files in https://doi.org/10.7910/DVN/0PMZC6). Maybe we can discuss that in this issue.
Here's an example of what the file metadata in the Schema.org export might look like when a pull request for this issue is merged:
The text was updated successfully, but these errors were encountered: