Include Lexeme support in Wikidata stats#36
Conversation
|
@ragesoss Could you review this when you are chanced. |
ragesoss
left a comment
There was a problem hiding this comment.
A note on how to read this: the review below was drafted with help from Claude Code. AI-assisted review feedback is offered as observations to consider, not authoritative conclusions — read it, take what's useful, ignore what isn't. The items flagged should be independently verified before acting on them.
Blocking
PR-introduced spec regressions in property_spec.rb and analyzer_spec.rb. Both files hardcode the full expected diff hash per revision, and the 16 new keys this PR adds (added/changed/removed_form_references, _form_qualifiers, _sense_references, _sense_qualifiers, plus added/changed_language, added/changed_lexical_category) aren't reflected in those hashes. lexeme_spec.rb was updated; these two weren't.
Reproduces locally on the PR branch:
$ bundle exec rspec
46 examples, 5 failures
rspec ./spec/wikidata/diff/property_spec.rb[1:1]
rspec ./spec/wikidata/diff/property_spec.rb[2:1]
rspec ./spec/wikidata/diff/analyzer_spec.rb:61
rspec ./spec/wikidata/diff/analyzer_spec.rb:71
rspec ./spec/wikidata/diff/analyzer_spec.rb:86
Same specs are 0/15 failures on main. Fix is mechanical — add the new keys with => 0 in each per-revision hash.
Lint
Six new RuboCop offenses on changed files; main is clean for the same set.
lib/wikidata/diff/inside_claim_analyzer.rb:70 Style/Next (autocorrectable)
lib/wikidata/diff/inside_claim_analyzer.rb:87 Style/Next (autocorrectable)
lib/wikidata/diff/revision_analyzer.rb Metrics/ClassLength [148/124]
lib/wikidata/diff/sense_analyzer.rb Metrics/ClassLength [132/124]
lib/wikidata/diff/sense_analyzer.rb Metrics/AbcSize [152.2/95]
lib/wikidata/diff/sense_analyzer.rb Metrics/MethodLength [130/98]
Either bump the limits in .rubocop_todo.yml to track the new ceiling, or extract a helper from SenseAnalyzer#isolate_senses_differences (the three nearly-identical "copy senseclaims fields" stanzas are a natural extraction point).
Coverage gap on the new form/sense reference & qualifier tracking
Across all 9 revisions in the updated lexeme_spec, every one of added/changed/removed_form_references, _form_qualifiers, _sense_references, _sense_qualifiers is 0. The new delegation paths through InsideClaimAnalyzer from FormAnalyzer and SenseAnalyzer would remain green even if entirely broken — they're verified by shape (the keys exist and sum to zero), not by behavior (any non-zero case).
Adding at least one revision that exercises a non-zero form-reference or sense-qualifier delta would pin the wiring. A direct unit test on InsideClaimAnalyzer with constructed fixtures would also work and avoids needing a real Wikidata revision.
Worth surfacing in the PR description
Beyond the FormAnalyzer#changed_forms overcount the description does call out, the PR also fixes two adjacent bugs in SenseAnalyzer that aren't mentioned:
- Changed senses had their inside-claim deltas miscomputed. On
main, theelsif current_claim != parent_claimbranch calledInsideClaimAnalyzer.isolate_inside_claim_differences(current_claim, nil)— passingnilfor the parent meant every claim under a changed sense was treated as added. The PR switches this to(current_sense, parent_sense), which correctly computes the delta. - Removed senses skipped their inside-claim removals entirely. On
main, the removed-sense block calledisolate_inside_claim_differences(current_claim, nil)wherecurrent_claimwas alreadynil(guarded bynext unless current_claim.nil?), so both args were nil and the call short-circuited. The PR switches this to(nil, parent_sense), which correctly tallies removals.
These look intentional and correct, but they're not in the description and aren't isolated by their own test. Worth either noting them in the PR description, or adding a focused unit test pinning the new behavior so it can't silently regress again.
Minor
LanguageAnalyzer doesn't handle the case where language or lexicalCategory is cleared (parent set, current nil). Probably fine given the lexeme data model requires both, but a one-line comment documenting the assumption would save the next reader the same investigation.
Drafted in a Claude Code session (~25 min, ~12 interactions). Sage directed the review phases and reviewed this comment before posting. Verified locally: the 5 spec failures are PR-introduced (clean on origin/main, fail on PR branch) and the 6 RuboCop offenses are PR-introduced (clean on origin/main for the same files). The changed_forms 7→2 arithmetic was checked against the spec revisions. The two SenseAnalyzer silent fixes were identified by diffing against origin/main:lib/wikidata/diff/sense_analyzer.rb but were not independently exercised by a constructed test. Other findings were spot-checked.
(Comment written by Claude Code.)
…me stat tracking - Added 16 new keys to property_spec.rb and analyzer_spec.rb to match the new diff shape - Extracted accumulate_senseclaims private helper in SenseAnalyzer to eliminate four repeated 9-line blocks resolves ClassLength, AbcSize, and MethodLength offenses without bumping limits - Autocorrected two Style/Next offenses in InsideClaimAnalyzer - Added inside_claim_spec.rb with fixture-based unit tests covering non-zero reference and qualifier paths - Added sense_analyzer_spec.rb pinning two SenseAnalyzer bug fixes: changed senses were passing nil as parent (overcounting claims as added), and removed senses were passing (nil, nil) (silently skipping all removals) - Bump rubocop Metrics/ClassLength to 148, this is genuinely needed since this PR adds new constants to the FORM_TYPES, SENSE_TYPES constants.
ragesoss
left a comment
There was a problem hiding this comment.
Re-review note (Claude Code–assisted; same caveats as before — treat as observations, not authoritative).
All five items from the prior review look addressed:
| Item | Resolution |
|---|---|
Spec regressions in property_spec / analyzer_spec |
New keys added to every per-revision hash |
| RuboCop offenses on changed files | Style/Next fixed in inside_claim_analyzer.rb; SenseAnalyzer refactored with extracted empty_senseclaim_totals / accumulate_senseclaims helpers, which drops AbcSize and MethodLength below the originally configured thresholds; Metrics/ClassLength max bumped 124→148 in .rubocop_todo.yml to track RevisionAnalyzer's new ceiling |
| Coverage gap on form/sense references & qualifiers | New spec/wikidata/diff/inside_claim_spec.rb adds 7 unit tests against constructed fixtures, covering added/removed/changed for both |
Undocumented SenseAnalyzer silent fixes |
New spec/wikidata/diff/sense_analyzer_spec.rb adds two specs that explicitly pin both behaviors, with comments referencing the prior behavior |
LanguageAnalyzer cleared-language assumption |
One-line comment documenting the assumption added |
Verified on the updated branch:
$ bundle exec rspec
55 examples, 0 failures (was 46 / 5)
$ bundle exec rubocop <7 touched lib files>
7 files inspected, no offenses detected
The SenseAnalyzer refactor preserves the exact (current, parent) arguments passed to InsideClaimAnalyzer at each call site, so the silent fixes from the prior commit remain in effect — checked against the previous commit's behavior.
No new issues found in the delta. One unrelated nit, non-blocking: FormAnalyzer still has the same shape of duplicated added_/removed_/changed_ stanzas that SenseAnalyzer just factored out. Not introduced by this PR and not flagged by RuboCop, so it's a consistency observation rather than a request.
Drafted in a Claude Code session (~10 min, ~6 interactions for the re-review). Sage directed the workflow and reviewed this comment before posting. Verified locally on the updated PR branch: full spec suite passes (55/0), RuboCop is clean on the seven touched lib files, and the previously-failing property_spec / analyzer_spec examples now pass. The SenseAnalyzer refactor was diffed against the prior PR commit to confirm the (current, parent) arguments to InsideClaimAnalyzer are preserved at every call site. The new spec files were read but not exhaustively exercised beyond running them.
(Comment written by Claude Code.)
|
Thanks @Formasitchijoh ! |
This PR is
PART Ito resolving #6878The analyzer's Lexeme support was minimal. While it tracked top-level lemmas, forms, senses, and glosses, it was missing several important categories of Lexeme-specific data that the Wiki Education Dashboard needs to provide meaningful stats on Lexeme contributions.
Changes
New: Language and lexical category tracking (LanguageAnalyzer)
Added a new
LanguageAnalyzerclass that tracks changes to the two Lexeme-only top-level fields:added_language / changed_languagewhen the lexeme's language (e.g. Q1860 = English) is set or changedadded_lexical_category / changed_lexical_categorywhen the part-of-speech (e.g. Q1084 = noun) is set or changedThese are only present on Lexeme entities and are zeroed out for items and properties.
Extended: References and qualifiers inside form and sense claims
InsideClaimAnalyzer(shared by both FormAnalyzer and SenseAnalyzer) now delegates to ReferenceAnalyzer and QualifierAnalyzer to count reference and qualifier changes within form-level and sense-level claims:added/removed/changed_form_references
added/removed/changed_form_qualifiers
added/removed/changed_sense_references
added/removed/changed_sense_qualifiers
Bug fixes: FormAnalyzer overcounting changed_forms
Wired into totals
All 16 new diff keys are added to
TOTAL_KEYSinwikidata-diff-analyzer.rband mapped in Total::KEY_MAPPING.Tests
spec/wikidata/diff/language_spec.rbwith unit tests covering all LanguageAnalyzer branches (new lexeme, language changed, category changed, nothing changed, both changed)AI Usage
Used AI to validate the important stats the Dashboard should track for Lexeme contributions and to understand the structure of Wikidata Lexeme API responses. Also used to reason through how InsideClaimAnalyzer could be extended to cover references and qualifiers for both forms and senses, and to identify the changed_forms overcount bug in FormAnalyzer.
Sample Lexeme
https://www.wikidata.org/wiki/Special:EntityData/L3796.json