Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Html "\"" in TOU breaks SchemaDotOrg export; and causes a 500 in dataset page. #8224

Open
landreev opened this issue Nov 8, 2021 · 4 comments

Comments

@landreev
Copy link
Contributor

landreev commented Nov 8, 2021

There are 2 issues in one:

  1. (simple, but important) The method getJsonLd() in DatasetPage.java needs to catch an exception from the export, as to NOT kill the page with a 500 when it fails, for whatever reason.
  2. The specific issue of an html entity " resulting in a failure to export and cache the jsonLd/SchemaDotOrg format. (Will add the details in the next comment, to keep the description compact).

To reproduce - screenshot from @pdurbin:

Screen Shot 2021-11-08 at 11 55 31 AM

@landreev
Copy link
Contributor Author

landreev commented Nov 8, 2021

Where the export error happens:

SchemaDotOrgExporter.java calls version.getJsonLd(); to produce the json string. That part works. However, before caching this output in a file, exportDataset() attempts to parse the string - to validate it, presumably? - and that's where it fails.

What happens at the end of version.getJsonLd() is

jsonLd = job.build().toString();
        
//Most fields above should be stripped/sanitized but, since this is output in the dataset page as header metadata, do a final sanitize step to make sure
jsonLd = MarkupChecker.stripAllTags(jsonLd);
        
return jsonLd;

MarkupChecker.stripAllTags() uses jsoup methods to sanitize the result; and that's where all the " entities are turned into unescaped double quotes. Thus invalidating the json.

A quick fix would be to add a regex to change " into escaped double quotes (\"), before calling stripAllTags().

Calling stripAllTags(); on the generated json string is inherently problematic though. A cleaner way would be to apply it to the individual fields used to cook the json.

From @pdurbin:

I think what doesn’t quite sit right to me is that we have this line:
jsonLd = job.build().toString();
… which should always created properly escaped JSON. But the later we do this:
jsonLd = MarkupChecker.stripAllTags(jsonLd);
… which has the potential to munge the JSON until it’s invalid.

But, seeing how this is the first issue of this kind we've run into - is it worth it? - tbd.

@landreev landreev changed the title Html "\&quote;" in TOU breaks SchemaDotOrg export; and causes a 500 in dataset page. Html "\"" in TOU breaks SchemaDotOrg export; and causes a 500 in dataset page. Nov 8, 2021
@pdurbin
Copy link
Member

pdurbin commented Mar 16, 2022

@landreev
Copy link
Contributor Author

This struck again, and it it cost us some time talking about it on slack.
I'll push to have this prioritized in the next sprint and make a quick pr fixing it.
Note that it doesn't need to be in TOU, just insert " into the description, or any other field to reproduce.

@jggautier
Copy link
Contributor

In Harvard Dataverse, v5.13, I was able to publish a dataset that had Put "ditto" marks around it. in several metadata fields (title, description, notes, and terms of use). The dataset was published and I was able to view the Schema.org export through the UI.

Does that mean this issue was fixed? I'm not sure if the first of the two issues that @landreev wrote about, about the method getJsonLd() in DatasetPage.java needing to catch an exception from the export, has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants