Skip to content

Extend Normalized Diffing to Package, URL, and PackageURL Operations #166

@sanchitram1

Description

@sanchitram1

Background

Following the successful implementation of standardized dependency diffing in #126 (see PR #3 on fork), the same normalization pattern can be extended to diff_pkg, diff_url, and diff_pkg_url.

The current pattern established for dependencies:

PM-specific data → normalize_*() → NormalizedPackage → diff_dependencies() → results

This issue proposes extending NormalizedPackage to include all package data, enabling a single normalization step that feeds all diff operations.

Current State

Each PM's diff.py contains ~80-100 lines of nearly identical logic:

  • diff_pkg: Check if package exists in cache → return (pkg_id, new_pkg | None, update_payload | None)
  • diff_url: Resolve URLs against cache/new_urls → return dict[UUID, UUID] (url_type_id → url_id)
  • diff_pkg_url: Link packages to URLs → return (new_links, updates)

The logic is 90%+ identical across crates, homebrew, debian, and pkgx.

Proposed Approach

Extend NormalizedPackage (single dataclass)

Instead of creating separate dataclasses for each operation, extend the existing NormalizedPackage to hold all normalized data:

@dataclass(frozen=True)
class ParsedURL:
    url: str
    url_type_id: UUID

@dataclass
class NormalizedPackage:
    # Package identification
    identifier: str          # import_id
    derived_id: str
    name: str
    readme: str | None
    
    # URLs for this package
    urls: list[ParsedURL]
    
    # Dependencies (already implemented in #126)
    dependencies: list[ParsedDependency]

Shared diff functions in core/diff.py

def diff_package(
    normalized: NormalizedPackage,
    cache: Cache,
    pm_id: UUID,
    now: datetime,
) -> tuple[UUID, Package | None, dict | None]:
    """Shared package diffing logic."""
    ...

def diff_urls(
    urls: list[ParsedURL],
    cache: Cache,
    new_urls: dict[URLKey, URL],
    now: datetime,
) -> dict[UUID, UUID]:
    """Shared URL resolution logic."""
    ...

def diff_package_urls(
    pkg_id: UUID,
    resolved_urls: dict[UUID, UUID],
    cache: Cache,
    now: datetime,
) -> tuple[list[PackageURL], list[dict]]:
    """Shared package-URL linking logic."""
    ...

PM normalizers become complete

Each PM's normalizer.py provides a single function that produces a complete NormalizedPackage:

def normalize_crates_package(crate: Crate, config: Config) -> NormalizedPackage:
    """Convert Crate to complete NormalizedPackage with all fields."""
    ...

PM diff.py after refactor

def diff_pkg(self, pkg: Crate) -> tuple[UUID, Package | None, dict | None]:
    normalized = normalize_crates_package(pkg, self.config)
    return diff_package(normalized, self.caches, self.config.pm_config.pm_id, self.now)

def diff_url(self, pkg: Crate, new_urls: dict[URLKey, URL]) -> dict[UUID, UUID]:
    normalized = normalize_crates_package(pkg, self.config)
    return diff_urls(normalized.urls, self.caches, new_urls, self.now)

def diff_pkg_url(self, pkg_id: UUID, resolved_urls: dict[UUID, UUID]) -> tuple[...]:
    return diff_package_urls(pkg_id, resolved_urls, self.caches, self.now)

Work Items

Core Infrastructure

  • Add ParsedURL dataclass to core/diff.py
  • Extend NormalizedPackage with derived_id, name, readme, urls fields
  • Implement diff_package() in core/diff.py
  • Implement diff_urls() in core/diff.py
  • Implement diff_package_urls() in core/diff.py
  • Add unit tests for shared functions

Package Manager Refactors

  • Update crates normalizer.py to produce complete NormalizedPackage
  • Update crates diff.py to use shared functions
  • Update homebrew normalizer.py to produce complete NormalizedPackage
  • Update homebrew diff.py to use shared functions
  • Update pkgx normalizer.py to produce complete NormalizedPackage
  • Update pkgx diff.py to use shared functions
  • Update debian normalizer.py to produce complete NormalizedPackage
  • Update debian diff.py to use shared functions

Acceptance Criteria

  • All existing tests pass
  • Each PM's diff_pkg, diff_url, diff_pkg_url reduced to ~3-5 lines each
  • Shared logic has comprehensive unit tests
  • No behavioral changes (same output for same input)
  • NormalizedPackage is the single source of truth for all diff operations

Notes

  • diff_pkg_url is already nearly identical across all PMs—easiest to consolidate first
  • diff_url mutates the new_urls dict passed in; this side effect should be preserved
  • Some PMs have slight variations in URL generation (e.g., debian's _generate_chai_urls)—normalizers handle this
  • Consider caching the normalized package if multiple diff operations are called sequentially

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions