Fix: skip-existing on PyPI publish (recurring 400 File already exists)#17
Closed
NameetP wants to merge 10 commits into
Closed
Fix: skip-existing on PyPI publish (recurring 400 File already exists)#17NameetP wants to merge 10 commits into
NameetP wants to merge 10 commits into
Conversation
…packaging The 0.1.0 published 2026-03-30 shipped the reader as a flat top-level module (llama_index_readers_pdfmux), so the conventional `from llama_index.readers.pdfmux import PDFMuxReader` import failed — the package was a broken LlamaIndex integration for 2+ months. 0.1.1 ships the correct llama_index/readers/pdfmux/ namespace package that LlamaIndex + LlamaHub expect, conforms PDFMuxReader to BaseReader, adds extra_info metadata merging, tightens dep pins, and adds tests (4 passing). Published to PyPI; LlamaIndex no longer accepts monorepo integration PRs (auto-closed), so this is maintained here + published independently. See PUBLISH.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lama-index-readers-pdfmux The README tagline claimed "LangChain + LlamaIndex loaders" but never gave the package names, install commands, or usage — so a RAG dev had no path from "it exists" to "pip install it". Both packages are live on PyPI (langchain-pdfmux 0.2.0, llama-index-readers-pdfmux 0.1.1); this documents the install + usage + the shared confidence-in-metadata contract that lets users filter low-confidence chunks before indexing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds PageDecision + RepairAttempt types and threads a per-page decision record (audit class/score -> budget verdict -> cascade outcomes incl. accepted/rejected repair attempts) through _multipass_extract -> process -> JSON output under an additive 'decision_trace' key. Backward-compatible additive schema bump 1.1.0 -> 1.2.0. Build #1 of the patent enabling builds (Appendix D). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Build #3 of the patent-enabling plan (claim family F — tiered provenance). Carries the recognition tier that produced each page's final text, plus the sub-page geometry (region bboxes) for the one tier that has it, end to end. - types.py: PageResult gains `provenance_tier` ("native"|"region"|"page"|"llm") and `regions` (the OCR'd WeakRegions). PageDecision gains `provenance_tier` and `region_bboxes`. Default tier is "page" — a bare PageResult under-claims rather than asserting spurious glyph geometry (honest tiering is the point). - regions.py: region_ocr_page now returns the WeakRegions it recovered text from, so the caller can attach their bboxes as "region"-tier provenance. - pipeline.py: _multipass_extract tags each recovered page native/region/page/ llm and threads regions + bboxes into the PageResult and PageDecision; native pages carry no region geometry. Both process() reconstruction points (image-table OCR, Arabic bidi) preserve the new fields. - json_fmt.py: schema 1.2.0 -> 1.3.0; decision_trace entries now carry provenance_tier + region_bboxes (additive). - Also removes two latent lint issues from the un-pushed decision-trace base (quoted annotations, an over-long comment) so the stack is CI-green. Tests: +5 (TestTieredProvenance, deterministic, no OCR engine needed) + decision-trace assertions. Full suite 689 passed / 3 skipped. Behavior unchanged on clean digital PDFs (all pages "native", no escalation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Build #2 of the patent-enabling plan (claim family 4 — the strongest new, narrowly-defensible island). Replaces the inconsistent accept-a-repair logic (char-length in three pipeline paths, raw confidence in agentic) with ONE calibrated gate, and retains every rejected candidate on the decision trace. audit.py — new `accept_repair(original, candidate, *, additive=...)`, the §5.8 guard, returning (accepted, score_before, score_after, reason): 1. Additive-patch-only for trusted native spans — a page whose audit score clears NATIVE_TRUST_THRESHOLD (0.80, env PDFMUX_NATIVE_TRUST) may only be augmented (region OCR), never wholesale-replaced by full OCR / LLM. 2. Hard-fail signals (`_repair_hard_fail`) — reject regardless of score delta if the candidate introduces mojibake, collapses the alphabetic ratio, suspiciously shortens (>50%), or loses headings/tables. 3. Calibrated audit-delta gate — full replacement must strictly beat the original by PDFMUX_REPAIR_MARGIN (default 0.0); additive patches need only be non-decreasing. Both arms are monotonic: quality never drops. Plus `repair_score_delta` — the single calibrated signal both paths share. pipeline.py — `_multipass_extract` routes region OCR (additive), full-page OCR, and the LLM through `accept_repair`, and records EVERY attempt (accepted AND rejected) into the page's PageDecision. A page that stays native still carries its rejected attempts — rejected-patch retention is the cleanest trace differentiator and was the unclaimed white space the plan identified. agentic.py — the `:123` raw-confidence gate now uses `accept_repair` too, so a longer-but-worse re-extraction can no longer win on length/confidence alone. Tests: +15 (tests/test_repair_gate.py — gate semantics, hard-fail signals, the margin knob, and deterministic accepted/rejected integration via a monkeypatched region-OCR). Updated test_agentic.py fixtures to realistic text so the calibrated gate is genuinely exercised. Full suite 704 passed / 3 skipped. No quality regression: eval/run_eval + eval/calibrate are byte-identical before and after — strict gate holds at 0.75 / precision 1.000 / recall 0.893. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mbargoed) Documents decision trace (1.2.0), tiered provenance (1.3.0), and the monotonic repair guard under an explicit "HELD FOR PATENT FILING — DO NOT PUBLISH" header, plus the two new env vars. Local-only, like the code it describes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The flag was parsed but never forwarded — extract_json takes no use_cache arg — so back-to-back eval runs silently returned stale cached scores. Set the env toggle before importing pdfmux so --no-cache actually disables the result cache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ema 1.4.0) Build #4 of the patent-enabling plan — Part A (policy-as-data) + Part B (closed-loop calibration). Strengthens claim families B/C/G and enables the task-conditioned + calibrated dependent claims. Part A — versioned policy object (policy.py): - One frozen `Policy` holds every extraction tunable: audit thresholds + score_page penalties/bands, OCR-budget params, Arabic + table thresholds, strict gate, and the repair margin/trust/hard-fail tolerances. Carries a `policy_id` (pdfmux-policy-v1.7), emitted in JSON output for reproducibility. - `load_policy()` folds PDFMUX_* env overrides in at load time and suffixes the policy_id with a content hash when any value changes — a tuned run never masquerades as canonical. - audit.py / detect.py / pipeline.py now read from the policy; the old module constants stay as backwards-compatible aliases. Default policy reproduces the historical constants exactly — behavior-neutral (eval byte-identical). Part B — runtime calibration loop (calibration.py + `pdfmux calibrate`): - Stdlib isotonic (PAVA) + Platt fitters, Expected Calibration Error, and `fit_calibration` (annotates ECE before/after). The map is monotonic — a higher raw score never yields a lower probability. - `pdfmux calibrate <labelled-dir> --method isotonic|platt --target ... --out` scores the labelled PDFs on their RAW confidence, fits the map, prints a reliability table + ECE, and writes a versioned policy file. - Runtime reloads it (`load_policy_file`, PDFMUX_POLICY_FILE) and applies the map so `confidence` is a calibrated probability. Closed loop: calibrate → write policy → reload. No-op (identity) until a policy is written, so default behavior is unchanged. eval/calibrate.py now reports the same ECE. - On the 50-fixture set an isotonic fit cuts in-sample ECE 0.133 → 0.000. Patent: claim the closed loop over the audit features, never the math. JSON schema 1.3.0 → 1.4.0 (additive: top-level `policy_id`). Tests: +30 (test_policy.py, test_calibration.py incl. an end-to-end calibrate→reload→apply case). Full suite 734 passed / 3 skipped; ruff + format clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on_trace Integrates the cache-key fix (from the concurrent nifty-colden session, commits 93c508e + f40b7a1) rebased onto the build-#4 stack (schema 1.4.0). - json_fmt.py: hoist the schema version into a single `SCHEMA_VERSION` constant. - result_cache.py: fold SCHEMA_VERSION into the cache key (filename gains an `__sv<hash>` segment) so a schema bump (1.2.0 → 1.3.0 → 1.4.0) invalidates every prior entry instead of serving a warm result that carries the old schema_version or lacks newly-added fields (decision_trace, provenance_tier, policy_id). Also preserves decision_trace on cache hits. This closes the stale-cache footgun that masked an eval comparison earlier in this build (back-to-back runs returned cached scores across a code change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-running the Publish to PyPI workflow on an already-published version fails with 400 File already exists (seen 2026-06-15, 2026-06-16). Adding skip-existing makes re-runs idempotent: PyPI rejects of pre-existing files are treated as a no-op instead of a hard failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
Author
|
Closing — this branch was inadvertently cut from a local main carrying embargoed (pre-patent-filing) commits, exposing them publicly. Deleting the branch and reopening a clean fix branched off origin/main. See replacement PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The Publish to PyPI workflow (
.github/workflows/publish.yml) fails with400 File already existswhenever it re-runs against a package version that is already on PyPI. Observed on 2026-06-15 and 2026-06-16.This happens because PyPI rejects uploads of artifacts that already exist (immutable releases). A re-run — manual retry, a re-published GitHub release, or any workflow restart on the same version — hits this hard failure.
Fix
Add
skip-existing: trueto thepypa/gh-action-pypi-publishstep. With this flag, the action treats already-present files as a no-op instead of erroring, making re-runs idempotent. Net-new files still upload normally.This is the official, safe remedy recommended by the pypa action for exactly this scenario. No source code or build logic touched — CI/CD only.
🤖 Generated with Claude Code