Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How much should whitespace matter? #93

Open
workingjubilee opened this issue Jan 31, 2024 · 1 comment
Open

How much should whitespace matter? #93

workingjubilee opened this issue Jan 31, 2024 · 1 comment

Comments

@workingjubilee
Copy link

workingjubilee commented Jan 31, 2024

Deeply open-ended question, but the following file is a direct copy of https://spdx.org/licenses/AGPL-1.0.html "by hand" (right-click, copy, paste), but askalono id only scores 0.999 instead of the 1.0 that printing the extract from the JSON gets you: LICENSE-RIGHTCLICK.txt

It's not clear to me which is the canonical version and thus which is (arguably) a license violation. It's also not clear to me that askalono should fudge the line breaks here. It's also not clear to me that askalono should NOT fudge the line breaks here.

@workingjubilee
Copy link
Author

workingjubilee commented Jan 31, 2024

I can't find an option to enable "massage newline differences like this one" in the library API, and I think that doing so might be worth it as an option on top of the whole "the return value is a ratio reflecting the scoring of it as a match" bit.

That said, the original issue seems to be a problem in the underlying data used: SPDX has subtle differences between the HTML and JSON renderings in terms of how it emits spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant