Skip to content

Include Lexeme support in Wikidata stats#36

Merged
ragesoss merged 2 commits into
WikiEducationFoundation:mainfrom
Formasitchijoh:lexeme-stats
May 22, 2026
Merged

Include Lexeme support in Wikidata stats#36
ragesoss merged 2 commits into
WikiEducationFoundation:mainfrom
Formasitchijoh:lexeme-stats

Conversation

@Formasitchijoh
Copy link
Copy Markdown
Contributor

@Formasitchijoh Formasitchijoh commented May 20, 2026

This PR is PART I to resolving #6878

The analyzer's Lexeme support was minimal. While it tracked top-level lemmas, forms, senses, and glosses, it was missing several important categories of Lexeme-specific data that the Wiki Education Dashboard needs to provide meaningful stats on Lexeme contributions.

Changes

New: Language and lexical category tracking (LanguageAnalyzer)

Added a new LanguageAnalyzer class that tracks changes to the two Lexeme-only top-level fields:

added_language / changed_language when the lexeme's language (e.g. Q1860 = English) is set or changed
added_lexical_category / changed_lexical_category when the part-of-speech (e.g. Q1084 = noun) is set or changed
These are only present on Lexeme entities and are zeroed out for items and properties.

Extended: References and qualifiers inside form and sense claims

InsideClaimAnalyzer (shared by both FormAnalyzer and SenseAnalyzer) now delegates to ReferenceAnalyzer and QualifierAnalyzer to count reference and qualifier changes within form-level and sense-level claims:

added/removed/changed_form_references
added/removed/changed_form_qualifiers
added/removed/changed_sense_references
added/removed/changed_sense_qualifiers

Bug fixes: FormAnalyzer overcounting changed_forms

  • elsif current_form was always truthy for any non-nil form object, causing every revision that had a form to be counted as a form change even when nothing changed. Fixed to elsif current_form != parent_form.
  • Changed senses previously passed nil as the parent to InsideClaimAnalyzer, treating every claim on a changed sense as newly added. Fixed to pass parent_sense, so only actual deltas are counted.
  • Removed senses previously passed (nil, nil) to InsideClaimAnalyzer because current_sense was already nil, silently skipping all reference and qualifier removals. Fixed to pass (nil, parent_sense). Both fixes are covered by spec/wikidata/diff/sense_analyzer_spec.rb

Wired into totals

All 16 new diff keys are added to TOTAL_KEYS in wikidata-diff-analyzer.rb and mapped in Total::KEY_MAPPING.

Tests

  • New spec/wikidata/diff/language_spec.rb with unit tests covering all LanguageAnalyzer branches (new lexeme, language changed, category changed, nothing changed, both changed)
  • spec/wikidata/diff/lexeme_spec.rb updated with all 16 new keys across all 9 test revisions and the aggregated total; forms_changed totals corrected from 7 → 2 following the bug fix

AI Usage

Used AI to validate the important stats the Dashboard should track for Lexeme contributions and to understand the structure of Wikidata Lexeme API responses. Also used to reason through how InsideClaimAnalyzer could be extended to cover references and qualifiers for both forms and senses, and to identify the changed_forms overcount bug in FormAnalyzer.

Sample Lexeme

https://www.wikidata.org/wiki/Special:EntityData/L3796.json

@Formasitchijoh
Copy link
Copy Markdown
Contributor Author

@ragesoss Could you review this when you are chanced.

Copy link
Copy Markdown
Member

@ragesoss ragesoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on how to read this: the review below was drafted with help from Claude Code. AI-assisted review feedback is offered as observations to consider, not authoritative conclusions — read it, take what's useful, ignore what isn't. The items flagged should be independently verified before acting on them.

Blocking

PR-introduced spec regressions in property_spec.rb and analyzer_spec.rb. Both files hardcode the full expected diff hash per revision, and the 16 new keys this PR adds (added/changed/removed_form_references, _form_qualifiers, _sense_references, _sense_qualifiers, plus added/changed_language, added/changed_lexical_category) aren't reflected in those hashes. lexeme_spec.rb was updated; these two weren't.

Reproduces locally on the PR branch:

$ bundle exec rspec
46 examples, 5 failures
rspec ./spec/wikidata/diff/property_spec.rb[1:1]
rspec ./spec/wikidata/diff/property_spec.rb[2:1]
rspec ./spec/wikidata/diff/analyzer_spec.rb:61
rspec ./spec/wikidata/diff/analyzer_spec.rb:71
rspec ./spec/wikidata/diff/analyzer_spec.rb:86

Same specs are 0/15 failures on main. Fix is mechanical — add the new keys with => 0 in each per-revision hash.

Lint

Six new RuboCop offenses on changed files; main is clean for the same set.

lib/wikidata/diff/inside_claim_analyzer.rb:70  Style/Next (autocorrectable)
lib/wikidata/diff/inside_claim_analyzer.rb:87  Style/Next (autocorrectable)
lib/wikidata/diff/revision_analyzer.rb         Metrics/ClassLength       [148/124]
lib/wikidata/diff/sense_analyzer.rb            Metrics/ClassLength       [132/124]
lib/wikidata/diff/sense_analyzer.rb            Metrics/AbcSize           [152.2/95]
lib/wikidata/diff/sense_analyzer.rb            Metrics/MethodLength      [130/98]

Either bump the limits in .rubocop_todo.yml to track the new ceiling, or extract a helper from SenseAnalyzer#isolate_senses_differences (the three nearly-identical "copy senseclaims fields" stanzas are a natural extraction point).

Coverage gap on the new form/sense reference & qualifier tracking

Across all 9 revisions in the updated lexeme_spec, every one of added/changed/removed_form_references, _form_qualifiers, _sense_references, _sense_qualifiers is 0. The new delegation paths through InsideClaimAnalyzer from FormAnalyzer and SenseAnalyzer would remain green even if entirely broken — they're verified by shape (the keys exist and sum to zero), not by behavior (any non-zero case).

Adding at least one revision that exercises a non-zero form-reference or sense-qualifier delta would pin the wiring. A direct unit test on InsideClaimAnalyzer with constructed fixtures would also work and avoids needing a real Wikidata revision.

Worth surfacing in the PR description

Beyond the FormAnalyzer#changed_forms overcount the description does call out, the PR also fixes two adjacent bugs in SenseAnalyzer that aren't mentioned:

  1. Changed senses had their inside-claim deltas miscomputed. On main, the elsif current_claim != parent_claim branch called InsideClaimAnalyzer.isolate_inside_claim_differences(current_claim, nil) — passing nil for the parent meant every claim under a changed sense was treated as added. The PR switches this to (current_sense, parent_sense), which correctly computes the delta.
  2. Removed senses skipped their inside-claim removals entirely. On main, the removed-sense block called isolate_inside_claim_differences(current_claim, nil) where current_claim was already nil (guarded by next unless current_claim.nil?), so both args were nil and the call short-circuited. The PR switches this to (nil, parent_sense), which correctly tallies removals.

These look intentional and correct, but they're not in the description and aren't isolated by their own test. Worth either noting them in the PR description, or adding a focused unit test pinning the new behavior so it can't silently regress again.

Minor

LanguageAnalyzer doesn't handle the case where language or lexicalCategory is cleared (parent set, current nil). Probably fine given the lexeme data model requires both, but a one-line comment documenting the assumption would save the next reader the same investigation.


Drafted in a Claude Code session (~25 min, ~12 interactions). Sage directed the review phases and reviewed this comment before posting. Verified locally: the 5 spec failures are PR-introduced (clean on origin/main, fail on PR branch) and the 6 RuboCop offenses are PR-introduced (clean on origin/main for the same files). The changed_forms 7→2 arithmetic was checked against the spec revisions. The two SenseAnalyzer silent fixes were identified by diffing against origin/main:lib/wikidata/diff/sense_analyzer.rb but were not independently exercised by a constructed test. Other findings were spot-checked.

(Comment written by Claude Code.)

…me stat tracking

- Added 16 new keys to property_spec.rb and analyzer_spec.rb to match the new diff shape
- Extracted accumulate_senseclaims private helper in SenseAnalyzer to eliminate four repeated 9-line blocks  resolves ClassLength, AbcSize, and MethodLength offenses without bumping limits
- Autocorrected two Style/Next offenses in InsideClaimAnalyzer
- Added inside_claim_spec.rb with fixture-based unit tests covering non-zero reference and qualifier paths
- Added sense_analyzer_spec.rb pinning two SenseAnalyzer bug fixes: changed senses were passing nil as parent (overcounting claims as added), and removed senses were passing (nil, nil) (silently skipping all removals)
- Bump  rubocop Metrics/ClassLength to 148, this is genuinely needed since this PR adds new constants to the FORM_TYPES, SENSE_TYPES constants.
@Formasitchijoh
Copy link
Copy Markdown
Contributor Author

@ragesoss @claude I made the changes requested

Copy link
Copy Markdown
Member

@ragesoss ragesoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review note (Claude Code–assisted; same caveats as before — treat as observations, not authoritative).

All five items from the prior review look addressed:

Item Resolution
Spec regressions in property_spec / analyzer_spec New keys added to every per-revision hash
RuboCop offenses on changed files Style/Next fixed in inside_claim_analyzer.rb; SenseAnalyzer refactored with extracted empty_senseclaim_totals / accumulate_senseclaims helpers, which drops AbcSize and MethodLength below the originally configured thresholds; Metrics/ClassLength max bumped 124→148 in .rubocop_todo.yml to track RevisionAnalyzer's new ceiling
Coverage gap on form/sense references & qualifiers New spec/wikidata/diff/inside_claim_spec.rb adds 7 unit tests against constructed fixtures, covering added/removed/changed for both
Undocumented SenseAnalyzer silent fixes New spec/wikidata/diff/sense_analyzer_spec.rb adds two specs that explicitly pin both behaviors, with comments referencing the prior behavior
LanguageAnalyzer cleared-language assumption One-line comment documenting the assumption added

Verified on the updated branch:

$ bundle exec rspec
55 examples, 0 failures        (was 46 / 5)

$ bundle exec rubocop <7 touched lib files>
7 files inspected, no offenses detected

The SenseAnalyzer refactor preserves the exact (current, parent) arguments passed to InsideClaimAnalyzer at each call site, so the silent fixes from the prior commit remain in effect — checked against the previous commit's behavior.

No new issues found in the delta. One unrelated nit, non-blocking: FormAnalyzer still has the same shape of duplicated added_/removed_/changed_ stanzas that SenseAnalyzer just factored out. Not introduced by this PR and not flagged by RuboCop, so it's a consistency observation rather than a request.


Drafted in a Claude Code session (~10 min, ~6 interactions for the re-review). Sage directed the workflow and reviewed this comment before posting. Verified locally on the updated PR branch: full spec suite passes (55/0), RuboCop is clean on the seven touched lib files, and the previously-failing property_spec / analyzer_spec examples now pass. The SenseAnalyzer refactor was diffed against the prior PR commit to confirm the (current, parent) arguments to InsideClaimAnalyzer are preserved at every call site. The new spec files were read but not exhaustively exercised beyond running them.

(Comment written by Claude Code.)

@ragesoss ragesoss merged commit 21a2914 into WikiEducationFoundation:main May 22, 2026
1 check passed
@ragesoss
Copy link
Copy Markdown
Member

Thanks @Formasitchijoh !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants