Skip to content

Conversation

@Kcruz28
Copy link
Contributor

@Kcruz28 Kcruz28 commented Dec 18, 2025

This PR adds a check for duplicate keys in forked_projects.json. I added a new function, find_duplicate_json_keys, which scans the raw JSON file to detect duplicate keys before it is parsed. I then updated run_checks_sort_fp to report an error if any duplicates are found. This helps catch cases where duplicate keys would otherwise be silently overwritten, making the checks more reliable.

Example error:
ERROR: On file format_checker/forked-projects.json: Duplicate in forked-projects.json keys detected: ['https://github.com/apache/incubator-shardingsphere']

Copy link
Contributor

@darko-marinov darko-marinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good starting point but some changes are needed to accept this in the checker.

from collections import Counter
import re

with open(file_path, "r") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Json library to read the file. It's presumably already done somewhere in this script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.load() is the standard way to parse JSON, but it automatically collapses duplicate keys by keeping only the last occurrence. Because of this behavior, any duplicate keys in the original file are lost during parsing. For that reason I did not use json.load().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying about json.load(); I wasnt' aware of the default behavior. It seems that json.loads(..., object_pairs_hook=...) allows for checking for duplicates (and doesn't require you to parse keys, assuming all key-value pairs are in the same line).

dups = find_duplicate_json_keys(file_path)
if dups:
log_esp_error(file_path, log, f"Duplicate in forked-projects.json keys detected: {dups}")
elif not is_fp_sorted(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the sorting check strictly increasing or just non-decreasing? Can we make it strictly increasing so that it subsumes the check for duplicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the sorting check strictly increasing would catch duplicates in the loaded data. But json.load() collapses duplicates when parsing, so any duplicates in the original file wouldn’t be detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants