Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Schema.org or Croissant metadata to header of Dataset view page #350

Open
ekraffmiller opened this issue Mar 19, 2024 · 5 comments · May be fixed by #412
Open

Add Schema.org or Croissant metadata to header of Dataset view page #350

ekraffmiller opened this issue Mar 19, 2024 · 5 comments · May be fixed by #412
Labels
FY24 Sprint 26 FY24 Sprint 26 GREI Re-arch GREI re-architecture-related pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows Size: 10 A percentage of a sprint. 7 hours. SPA: Dataset page (View) Waiting

Comments

@ekraffmiller
Copy link
Contributor

ekraffmiller commented Mar 19, 2024

Currently the JSF Dataset page has schema.org info embedded in the header, which in the future may be replaced with Croissant. The SPA version of the page has to replicate this.
Here is what it looks like in the JSF Header:

<script type="application/ld+json">{"@context":"http://schema.org","@type":"Dataset","@id":"https://doi.org/10.5072/FK2/SCYB0O","identifier":"https://doi.org/10.5072/FK2/SCYB0O","name":"Testing embargo","creator":[{"@type":"Person","givenName":"Guillermo","familyName":"Portas","name":"Portas, Guillermo"}],"author":[{"@type":"Person","givenName":"Guillermo","familyName":"Portas","name":"Portas, Guillermo"}],"datePublished":"2024-03-14","dateModified":"2024-03-14","version":"1","description":"test","keywords":["Business and Management"],"license":"http://creativecommons.org/publicdomain/zero/1.0","includedInDataCatalog":{"@type":"DataCatalog","name":"Root","url":"https://beta.dataverse.org"},"publisher":{"@type":"Organization","name":"Root"},"provider":{"@type":"Organization","name":"Root"},"distribution":[{"@type":"DataDownload","name":"dataverse_files (2).zip","encodingFormat":"application/zip","contentSize":4540,"contentUrl":"https://beta.dataverse.org/api/access/datafile/26133"},{"@type":"DataDownload","name":"FilesIT.java","encodingFormat":"text/x-java-source","contentSize":154657,"contentUrl":"https://beta.dataverse.org/api/access/datafile/26132"}]}

The Dataverse API for getting this uses the exporter, for Schema.org:
https://beta.dataverse.org/api/datasets/export?exporter=schema.org&persistentId=doi:10.5072/FK2/SCYB0O
And for Croissant format:
https://beta.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi:10.5072/FK2/SCYB0O

To test rich results Search Google Rich Results

@ekraffmiller ekraffmiller added pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows SPA: Dataset page (View) labels Mar 19, 2024
@pdurbin
Copy link
Member

pdurbin commented May 7, 2024

One concern we have is what to do when the schema.org or croissant files get large, such as 7 MB for a dataset with 25k files. These issues are related:

Also, in JSF we show the schema.org version unless the croissant jar file is present:

I wrote some docs about this in an (open) pull request:

@g-saracca g-saracca added the Size: 10 A percentage of a sprint. 7 hours. label May 22, 2024
@g-saracca
Copy link
Contributor

g-saracca commented May 22, 2024

For a quick proof of concept, it would be ideal to do a simple insert of the expected script (hardcoded & type="application/ld+json") in question into the head of the single index.html that handles the SPA.
Simply from the home (Collection page), in a useEffect that runs only once, so we can simulate how it would really be the insertion of this script inside the head once the SPA Javascript is loaded and thus confirm through Search Google Rich Results if the script is being detected or not.

As a second approach, if we know that the script is detected, we should detect the persistentId in question through the url of the page of a Dataset, fetch the endpoint mentioned with the persitentId and insert the result in a script type “application/ld+json” in the header of the html.
And when the user navigates away of the page, in the return of the useEffect that will be executed when this component/page is unmounted, delete the script in question. (This only if we are not in a mobile device, this could be detected in a very simple way at the moment through the screen width.)

useEffect(() => {
  const contentOfTheScriptToInsert = fetchToLoadScript()

  // Insert the script into the head of the document here...

  return () => {
    // Remove the script from the head of the document here...
  };
}, []);

@ekraffmiller
Copy link
Contributor Author

beta.dataverse.org has been updated with a robots.txt to allow all, so now https:/beta.dataverse.org is being crawled successfully, but individual dataset pages are not being indexed by Google. See this page for the Rich Results test: https://search.google.com/test/rich-results/result?id=XS1bhHFD7CEtXP5vHMIxog. Putting it back in This Sprint for further investigation, since it's a lower priority for Q2.

@ekraffmiller ekraffmiller removed their assignment Jun 11, 2024
@cmbz cmbz added FY24 Sprint 26 FY24 Sprint 26 GREI Re-arch GREI re-architecture-related labels Jun 20, 2024
@g-saracca g-saracca self-assigned this Jul 2, 2024
@g-saracca
Copy link
Contributor

Moving it to the backlog due to a problem with the server configuration for the SPA redirection.
Currently when entering directly to a SPA url other than the main /spa/ it is returning the index.html document but with a 404.
This is because of web.xml located on frontend repo under deployments/payara/ is handling urls that dont belong to an actual file or folder as an error page and returning index.html with a 404 Not Found page status, making it not crawlable.

  <error-page>
    <error-code>404</error-code>
    <location>/index.html</location>
  </error-page>

This problem must be solved in order to return to this issue.

@g-saracca g-saracca removed their assignment Jul 3, 2024
@cmbz
Copy link

cmbz commented Jul 10, 2024

2024/07/10

  • Removing the On Hold status and moving back to SPA classification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY24 Sprint 26 FY24 Sprint 26 GREI Re-arch GREI re-architecture-related pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows Size: 10 A percentage of a sprint. 7 hours. SPA: Dataset page (View) Waiting
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants