Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0 as a string should be a valid version for a dataset #609

Open
pdurbin opened this issue Mar 15, 2024 · 8 comments
Open

1.0 as a string should be a valid version for a dataset #609

pdurbin opened this issue Mar 15, 2024 · 8 comments
Assignees

Comments

@pdurbin
Copy link
Member

pdurbin commented Mar 15, 2024

Dataverse uses MAJOR.MINOR for dataset versions like 1.0, 1.1, 2.0. I believe this should be valid. I'm aware that the validator wants MAJOR.MINOR.PATCH (output below) and that under the Version section, the Croissant 1.0 spec says the same (Semantic Versioning 2.0.0).

I'm also aware that 1.0 as number instead of a string is considered valid (thanks @goeffthomas for pointing this out), but I don't see this in the spec and I think versions should be strings (1.0.0 isn't a number anyway). I see from #593 and #594 that just 1 as a number is also valid but this isn't a solution for Dataverse.

Below is the warning I see as of 1.0.3 from the validator when testing my work-in-progress Croissant file (also below). The relevant line in the JSON is, of course, the following:

    "version": "1.0",

So! Can we consider 1.0 valid?

Thanks for your consideration!

$ mlcroissant validate --jsonld croissant.json
W0315 16:45:48.010244 140704470599424 datasets.py:32] Found the following 1 warning(s) during the validation:
  -  [Metadata(Cars)] Version doesn't follow MAJOR.MINOR.PATCH: 1.0. For more information refer to: https://semver.org/spec/v2.0.0.html
I0315 16:45:48.010346 140704470599424 validate.py:53] Done.
$ 
$ cat croissant.json
{
    "@context": {
        "@language": "en",
        "@vocab": "https://schema.org/",
        "citeAs": "cr:citeAs",
        "column": "cr:column",
        "conformsTo": "dct:conformsTo",
        "cr": "http://mlcommons.org/croissant/",
        "data": {
            "@id": "cr:data",
            "@type": "@json"
        },
        "dataBiases": "cr:dataBiases",
        "dataCollection": "cr:dataCollection",
        "dataType": {
            "@id": "cr:dataType",
            "@type": "@vocab"
        },
        "dct": "http://purl.org/dc/terms/",
        "extract": "cr:extract",
        "field": "cr:field",
        "fileProperty": "cr:fileProperty",
        "fileObject": "cr:fileObject",
        "fileSet": "cr:fileSet",
        "format": "cr:format",
        "includes": "cr:includes",
        "isEnumeration": "cr:isEnumeration",
        "jsonPath": "cr:jsonPath",
        "key": "cr:key",
        "md5": "cr:md5",
        "parentField": "cr:parentField",
        "path": "cr:path",
        "personalSensitiveInformation": "cr:personalSensitiveInformation",
        "recordSet": "cr:recordSet",
        "references": "cr:references",
        "regex": "cr:regex",
        "repeated": "cr:repeated",
        "replace": "cr:replace",
        "sc": "https://schema.org/",
        "separator": "cr:separator",
        "source": "cr:source",
        "subField": "cr:subField",
        "transform": "cr:transform",
        "wd": "https://www.wikidata.org/wiki/"
    },
    "@type": "sc:Dataset",
    "conformsTo": "http://mlcommons.org/croissant/1.0",
    "name": "Cars",
    "version": "1.0",
    "citeAs": "https://doi.org/10.5072/FK2/EKY1NP",
    "distribution": [
        {
            "@type": "cr:FileObject",
            "name": "stata13-auto.dta",
            "encodingFormat": "application/x-stata-13",
            "md5": "7b1201ce6b469796837a835377338c5a",
            "sha256": "see md5",
            "contentUrl": "stata13-auto.dta"
        }
    ],
    "license": "http://creativecommons.org/publicdomain/zero/1.0",
    "datePublished": "2024-03-13",
    "dateModified": "2024-03-13"
}
pdurbin added a commit to gdcc/dataverse-exporters that referenced this issue Apr 12, 2024
Note the long comment about the perils of doing this.

Let's hope to get some attention on our appeal to allow 1.0
to be a valid version for a dataset:
mlcommons/croissant#609
@pdurbin
Copy link
Member Author

pdurbin commented Apr 12, 2024

Over at gdcc/dataverse-exporters@3e95c97 I pushed a commit to simply append ".0" to the version of Dataverse datasets, but I'm pretty grumpy about it because it's really suboptimal for us. Here's the comment I left in the code:

  • We append ".0" to our version string to make the Croissant
  • validator happy and to comply with the Croissant spec but we
  • would rather not.
  • Here's what the spec says:
  • "The recommended versioning scheme to use for datasets is
  • MAJOR.MINOR.PATCH, following Semantic Versioning 2.0.0." --
  • https://mlcommons.github.io/croissant/docs/croissant-spec.html#version
  • If a Dataverse user sees 1.0.0 they will hopefully know that they
  • must remove the final ".0" to see the actual version of the
  • dataset they want. Note that as of Dataverse 6.1, if you navigate
  • to 1.0.0 you will be redirected to the latest published version,
  • which is possibly quite different than version 1.0 of the
  • dataset! For example you will be redirected to version 4.0 (as of
  • this writing) if you try to navigate to
  • https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP&version=1.0.0
  • This is suboptimal, of course. The API will simply show the error
  • "Illegal version identifier '1.0.0'" if you try to access version
  • 1.0.0 of a dataset via API. Here's an example
  • https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0.0?persistentId=doi:10.7910/DVN/TJCLKP
  • Either way, a version of 1.0.0 is not helpful to a Dataverse
  • user. The GUI may send you to an old version. The API will bark
  • at you that version is illegal.
  • We opened 1.0 as a string should be a valid version for a dataset #609 to
  • argue that 1.0 is a perfect valid version for a dataset.
  • Hopefully in the future we can stop appending ".0"!
  • By the way, it's technically possible to make the Croissant
  • validator happy by passing 1.0 as a float. However, this is not
  • compliant with the Croissant spec which (again) wants 1.0.0. We
  • are sticking with strings because they are a more sane way to
  • represent a version and they're what SemVer uses, even allowing
  • "-alpha", etc.

@pdurbin
Copy link
Member Author

pdurbin commented May 2, 2024

Today I played around with where to put the real version of a Dataverse dataset (e.g. "1.0") but I can't find a good place without triggering schema.org violations.

Again, users need to real version number if they want to use Dataverse APIs successfully, as described in my comment above.

Right now my Croissant implementation appends an extra ".0" but I'd rather just put "1.0" and be able to configure the Croissant validator to ignore the warning, so I just opened this related issue:

@benjelloun
Copy link
Contributor

@ccl-core Can you please take a look at this issue?

@ccl-core ccl-core self-assigned this May 16, 2024
@ccl-core
Copy link
Contributor

Hi @pdurbin , thank you for reporting this issue!

Just to clarify, 1.0 is a valid version: the validation script is just throwing a warning.

The mlcroissant library takes already care of the version casting, so you should be able to use your croissant file with 1.0 version.

Unless your issue is about the warning itself: do you think that the warning message is wrong or incomplete?

@pdurbin
Copy link
Member Author

pdurbin commented May 17, 2024

@ccl-core thanks for getting back to me.

First, in terms of validity, the spec implies that ANY string is valid, right? "foobar" and "supercalifragilisticexpialidocious" are both valid versions. This is because version in schema.org can be "Number" or "Text".

The table below summarizes my understanding of the how the validator and the spec treat the following values. (I put strings in quotes and left quotes off numbers.)

Current state

Value Validator Spec
"1.0.0" no warnings ✅ recommended ✅
1.0 or 1 no warnings ✅ valid 🤷‍♀️
"1.0" warning ❌ valid 🤷‍♀️
"foobar" warning ❌ valid 🤷‍♀️

This feels off to me. I think a step in the right direction would be to allow the validator to not throw a warning for "1.0" (as a string), just like how it doesn't throw a warning for 1 or 1.0 (as numbers). Further, perhaps the spec should be updated to mention that 1 and 1.0 (as numbers) get special treatment (at least by the validator). And perhaps "1.0" (as a string) would get the same special treatment in the future. Here's an updated table:

Option 1: treat "1.0" like 1.0

Value Validator Spec
"1.0.0" no warnings ✅ recommended ✅
1.0 or 1 no warnings ✅ special treatment mentioned 👀
"1.0" no warnings ✅ special treatment mentioned 👀
"foobar" warning ❌ valid 🤷‍♀️

If you're thinking, "just use 1.0 as a number instead of a string", the reason I'm reluctant to do this is that if Dataverse ever does switch from MAJOR.MINOR to MAJOR.MINOR.PATCH (as Croissant recommends) it would mean switching the type from number to string, which would be a breaking change for our users. I'd rather use strings in our Croissant export from the start so we can more easily switch from "1.0" to "1.0.0" in the future. I think we all know that in the real world, versions are usually strings, not numbers. SemVer is a good example.

Option 2: pass a flag to ignore version warnings

Another perfectly valid option could be to allow people like me who don't want to see the warning about version to pass a flag to the validator to suppress the "Version doesn't follow MAJOR.MINOR.PATCH" warning. I'm happy to see you're working on this (thanks!):

I imagine it will resolve the issue I opened:

With this option, I still think it would be good to update the spec to explain that 1 and 1.0 are given special treatment (no warnings shown):

Value Validator Spec
"1.0.0" no warnings ✅ recommended ✅
1.0 or 1 no warnings ✅ special treatment mentioned 👀
"1.0" warning ❌ valid 🤷‍♀️
"foobar" warning ❌ valid 🤷‍♀️

I hope this helps! Thanks again!

@benjelloun
Copy link
Contributor

Drive-by comment: I dislike the fact that versions can be numbers... As numbers, there is no difference between 1 and 1.0, even though we may consider them to be different values in terms of versions.

For the next version of Croissant, I wonder if we should constrain version to always be a string and recommend that it follows one of the patterns: "x", "x.y" or "x.y.z", where x y and z are integers, and maybe allow other strings but issue a (suppressible) warning if it doesn't follow one of these patterns.

What do you all think?

@pdurbin
Copy link
Member Author

pdurbin commented May 17, 2024

@benjelloun yeah, it sounds like we're in agreement that versions should be strings, not numbers.

In the schema.org world, I'm not familiar with how best to constrain a field like version that allows both Text and Number. I suppose we could have cr:version that has its own rules.

If we want to stick with SemVer, please keep in mind that it's not just integers and dots (periods). You'll see examples like "1.0.0-alpha" and "1.0.0+21AF26D3" in the spec (under "MAY" for pre-release and build metadata).

@benjelloun
Copy link
Contributor

We can constrain sc:version by changing its documentation in our spec, and validate the field for datasets that specify that they comply with Croissant 1.0.

Indeed, for SemVer we would need some regexp like https://semver.org/#is-there-a-suggested-regular-expression-regex-to-check-a-semver-string

ccl-core added a commit that referenced this issue May 22, 2024
Remove warning when version doesn't conform to MAJOR.MINOR.PATCH

Ref. ##609
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants