Skip to content

Fix NumberFormatException in NPM Package Json Parse #1435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

zahidblackduck
Copy link
Collaborator

@zahidblackduck zahidblackduck commented May 22, 2025

JIRA Ticket

IDETECT-4721

Description

Issue

The NPM Package Json Parse failed with a NumberFormatException when parsing version strings like ^17.0.0-1551262265873. The numeric suffix was mistakenly treated as a valid version.

Our version parsing logic splits the input string using spaces, ||, or -. This mistakenly accepted long numeric segments (e.g. 1551262265873), which are likely timestamps or hashes, as valid versions due to the regex match. And thus causing a NumberFormatException in the SemVerComparator#compare(..) method.

Fix

Added a helper method isProbableVersion to validate version tokens:

  • Accepts standard semver formats: X, X.Y, X.Y.Z.
  • Rejects purely numeric strings longer than 5 digits based on the intuition that valid version components rarely exceed 5 digits, whereas timestamps or hashes typically do.

@zahidblackduck zahidblackduck changed the title Fix NumberFormatException in NPM Package Json Parse for Long Numeric Pre-release or Timestamp Tags Fix NumberFormatException in NPM Package Json Parse May 22, 2025
@zahidblackduck zahidblackduck self-assigned this May 22, 2025
@@ -96,12 +96,20 @@ private String extractLowestVersion(String value) {
// Remove npm version selection characters that the KB won't match on
.map(part -> part.replaceAll("[>=<~^]", ""))
// Filter out parts that don't match the version pattern
.filter(part -> part.matches("\\d+\\.\\d+\\.\\d+|\\d+\\.\\d+|\\d+"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly trying to understand what is going on before reviewing the changes. My understanding was that if ^17.0.0-1551262265873 came into this lambda that the replaceAll call would remove the ^ and then 17.0.0-1551262265873 would fail the part.matches filter because of the - and never get to the min call.

I must be missing something because I believe multiple devs have confirmed this passes the filter and blows up in the semVerComparator check but I guess I'm wondering a) how is it getting by the filter and b) can we perhaps fix the filter's regex instead of imposing the version character limit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful query, David. Here’s what’s actually happening:

  1. Input: ^17.0.0-1551262265873

  2. The string is first split using value.split("\\s+|\\|\\||-"), so it's broken into:

    • "^17.0.0"
    • "1551262265873"
  3. Then .replaceAll("[>=<~^]", "") is applied to each part:

    • "^17.0.0" becomes "17.0.0"
    • "1551262265873" stays the same
  4. Both parts are passed to the version filter. The earlier regex (\\d+\\.\\d+\\.\\d+|\\d+\\.\\d+|\\d+) accepts "1551262265873" because it matches \\d+.

  5. That long numeric value is passed to SemVerComparator, which attempts to parse it using Integer.parseInt — resulting in a NumberFormatException.

To avoid that, I added the isProbableVersion() helper to guard against purely numeric values that are too long (likely timestamps or hashes rather than real versions).

private boolean isProbableVersion(String part) {
// If purely numeric and very long, it's likely a timestamp/hash
if (part.matches("\\d{6,}")) return false;
// Accept X, X.Y, X.Y.Z format
return part.matches("\\d+(\\.\\d+){0,2}");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good discussion.

I wonder if we should try to differentiate between situations when - is used as a range operator vs when it's part of the version. It seems (but worth double-checking) like if it's a range operator then there must be spaces around it.

Copy link
Contributor

@dterrybd dterrybd May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very helpful explanation, I had ignored the split call.

I looked at the semver rules and a version must be a non-negative integer. That's not super helpful as an integer is kind of defined by the system and language.

I wonder if instead of limiting to 5 or 6 characters we could try to assign the variable part to an integer and then catch the NumberFormatException and return false if things don't go well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. I'll try this out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good discussion.

I wonder if we should try to differentiate between situations when - is used as a range operator vs when it's part of the version. It seems (but worth double-checking) like if it's a range operator then there must be spaces around it.

The current implementation doesn't consider hyphen (-) as range operator. Rather it just splits the input version based on space or hyphen. And, if the parts are numeric, then it accepts the minimum value among all, using semantic version comparator.

Copy link
Contributor

@andrian-sevastyanov andrian-sevastyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but left another comment upon second thought.

assertEquals("1.2.0", extractor.extractLowestVersion("~1.2.x"));
assertEquals("1.2.3", extractor.extractLowestVersion("1.2.3-beta.2"));
assertEquals("5.0.0", extractor.extractLowestVersion( "5.0.0-alpha+build.1"));
assertEquals("3.0.0", extractor.extractLowestVersion("3.0.0-abc123"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test like assertEquals("3.0.0", extractor.extractLowestVersion("3.0.0-1")); will fail here because we're treating the - as an operator instead of as part of the version.
This is why I was suggesting treating - as an operator only when it's surrounded by spaces.

Copy link
Collaborator Author

@zahidblackduck zahidblackduck Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the extractLowestVersion method to treat range separators (" - ") and pre-release hyphens ("-") differently by using this split pattern:

String[] parts = value.split("\\s+|\\|\\||\\s-\\s");

This ensures that we only split on version ranges like "1.0.0 - 2.0.0" (as used in npm), space, and logical ORs (||), but preserve valid pre-release versions like "3.0.0-1".

Additionally, the line:

.map(part -> part.replaceAll("[-+].*", ""))

is used to strip off pre-release and build metadata.

So for the test:

assertEquals("3.0.0", extractor.extractLowestVersion("3.0.0-1"));

this still passes because:

  • The input is a single version (not a range), so it’s not split.
  • The pre-release -1 is stripped, resulting in 3.0.0.

References:

Few additional tests were added to validate the changes.

String[] parts = value.split("\\s+|\\|\\||-");
String lowestVersion = Arrays.stream(parts)
// Split the value into parts by spaces, "||", or " - ".
String[] parts = value.split("\\s+|\\|\\||\\s-\\s");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the \\s-\\s part never actually gets matched now because \\s+ will get matched first.
This however works out in the end because on line 100 standalone - characters get transformed to empty string and at line 102 the empty string gets filtered out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants