Fix: skip-existing on PyPI publish (recurring 400 File already exists) by NameetP · Pull Request #17 · NameetP/pdfmux

NameetP · 2026-06-29T06:43:43Z

Problem

The Publish to PyPI workflow (.github/workflows/publish.yml) fails with 400 File already exists whenever it re-runs against a package version that is already on PyPI. Observed on 2026-06-15 and 2026-06-16.

This happens because PyPI rejects uploads of artifacts that already exist (immutable releases). A re-run — manual retry, a re-published GitHub release, or any workflow restart on the same version — hits this hard failure.

Fix

Add skip-existing: true to the pypa/gh-action-pypi-publish step. With this flag, the action treats already-present files as a no-op instead of erroring, making re-runs idempotent. Net-new files still upload normally.

      - uses: pypa/gh-action-pypi-publish@release/v1
        with:
          skip-existing: true

This is the official, safe remedy recommended by the pypa action for exactly this scenario. No source code or build logic touched — CI/CD only.

🤖 Generated with Claude Code

…packaging The 0.1.0 published 2026-03-30 shipped the reader as a flat top-level module (llama_index_readers_pdfmux), so the conventional `from llama_index.readers.pdfmux import PDFMuxReader` import failed — the package was a broken LlamaIndex integration for 2+ months. 0.1.1 ships the correct llama_index/readers/pdfmux/ namespace package that LlamaIndex + LlamaHub expect, conforms PDFMuxReader to BaseReader, adds extra_info metadata merging, tightens dep pins, and adds tests (4 passing). Published to PyPI; LlamaIndex no longer accepts monorepo integration PRs (auto-closed), so this is maintained here + published independently. See PUBLISH.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…lama-index-readers-pdfmux The README tagline claimed "LangChain + LlamaIndex loaders" but never gave the package names, install commands, or usage — so a RAG dev had no path from "it exists" to "pip install it". Both packages are live on PyPI (langchain-pdfmux 0.2.0, llama-index-readers-pdfmux 0.1.1); this documents the install + usage + the shared confidence-in-metadata contract that lets users filter low-confidence chunks before indexing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds PageDecision + RepairAttempt types and threads a per-page decision record (audit class/score -> budget verdict -> cascade outcomes incl. accepted/rejected repair attempts) through _multipass_extract -> process -> JSON output under an additive 'decision_trace' key. Backward-compatible additive schema bump 1.1.0 -> 1.2.0. Build #1 of the patent enabling builds (Appendix D). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Build #3 of the patent-enabling plan (claim family F — tiered provenance). Carries the recognition tier that produced each page's final text, plus the sub-page geometry (region bboxes) for the one tier that has it, end to end. - types.py: PageResult gains `provenance_tier` ("native"|"region"|"page"|"llm") and `regions` (the OCR'd WeakRegions). PageDecision gains `provenance_tier` and `region_bboxes`. Default tier is "page" — a bare PageResult under-claims rather than asserting spurious glyph geometry (honest tiering is the point). - regions.py: region_ocr_page now returns the WeakRegions it recovered text from, so the caller can attach their bboxes as "region"-tier provenance. - pipeline.py: _multipass_extract tags each recovered page native/region/page/ llm and threads regions + bboxes into the PageResult and PageDecision; native pages carry no region geometry. Both process() reconstruction points (image-table OCR, Arabic bidi) preserve the new fields. - json_fmt.py: schema 1.2.0 -> 1.3.0; decision_trace entries now carry provenance_tier + region_bboxes (additive). - Also removes two latent lint issues from the un-pushed decision-trace base (quoted annotations, an over-long comment) so the stack is CI-green. Tests: +5 (TestTieredProvenance, deterministic, no OCR engine needed) + decision-trace assertions. Full suite 689 passed / 3 skipped. Behavior unchanged on clean digital PDFs (all pages "native", no escalation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Build #2 of the patent-enabling plan (claim family 4 — the strongest new, narrowly-defensible island). Replaces the inconsistent accept-a-repair logic (char-length in three pipeline paths, raw confidence in agentic) with ONE calibrated gate, and retains every rejected candidate on the decision trace. audit.py — new `accept_repair(original, candidate, *, additive=...)`, the §5.8 guard, returning (accepted, score_before, score_after, reason): 1. Additive-patch-only for trusted native spans — a page whose audit score clears NATIVE_TRUST_THRESHOLD (0.80, env PDFMUX_NATIVE_TRUST) may only be augmented (region OCR), never wholesale-replaced by full OCR / LLM. 2. Hard-fail signals (`_repair_hard_fail`) — reject regardless of score delta if the candidate introduces mojibake, collapses the alphabetic ratio, suspiciously shortens (>50%), or loses headings/tables. 3. Calibrated audit-delta gate — full replacement must strictly beat the original by PDFMUX_REPAIR_MARGIN (default 0.0); additive patches need only be non-decreasing. Both arms are monotonic: quality never drops. Plus `repair_score_delta` — the single calibrated signal both paths share. pipeline.py — `_multipass_extract` routes region OCR (additive), full-page OCR, and the LLM through `accept_repair`, and records EVERY attempt (accepted AND rejected) into the page's PageDecision. A page that stays native still carries its rejected attempts — rejected-patch retention is the cleanest trace differentiator and was the unclaimed white space the plan identified. agentic.py — the `:123` raw-confidence gate now uses `accept_repair` too, so a longer-but-worse re-extraction can no longer win on length/confidence alone. Tests: +15 (tests/test_repair_gate.py — gate semantics, hard-fail signals, the margin knob, and deterministic accepted/rejected integration via a monkeypatched region-OCR). Updated test_agentic.py fixtures to realistic text so the calibrated gate is genuinely exercised. Full suite 704 passed / 3 skipped. No quality regression: eval/run_eval + eval/calibrate are byte-identical before and after — strict gate holds at 0.75 / precision 1.000 / recall 0.893. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…mbargoed) Documents decision trace (1.2.0), tiered provenance (1.3.0), and the monotonic repair guard under an explicit "HELD FOR PATENT FILING — DO NOT PUBLISH" header, plus the two new env vars. Local-only, like the code it describes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The flag was parsed but never forwarded — extract_json takes no use_cache arg — so back-to-back eval runs silently returned stale cached scores. Set the env toggle before importing pdfmux so --no-cache actually disables the result cache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ema 1.4.0) Build #4 of the patent-enabling plan — Part A (policy-as-data) + Part B (closed-loop calibration). Strengthens claim families B/C/G and enables the task-conditioned + calibrated dependent claims. Part A — versioned policy object (policy.py): - One frozen `Policy` holds every extraction tunable: audit thresholds + score_page penalties/bands, OCR-budget params, Arabic + table thresholds, strict gate, and the repair margin/trust/hard-fail tolerances. Carries a `policy_id` (pdfmux-policy-v1.7), emitted in JSON output for reproducibility. - `load_policy()` folds PDFMUX_* env overrides in at load time and suffixes the policy_id with a content hash when any value changes — a tuned run never masquerades as canonical. - audit.py / detect.py / pipeline.py now read from the policy; the old module constants stay as backwards-compatible aliases. Default policy reproduces the historical constants exactly — behavior-neutral (eval byte-identical). Part B — runtime calibration loop (calibration.py + `pdfmux calibrate`): - Stdlib isotonic (PAVA) + Platt fitters, Expected Calibration Error, and `fit_calibration` (annotates ECE before/after). The map is monotonic — a higher raw score never yields a lower probability. - `pdfmux calibrate <labelled-dir> --method isotonic|platt --target ... --out` scores the labelled PDFs on their RAW confidence, fits the map, prints a reliability table + ECE, and writes a versioned policy file. - Runtime reloads it (`load_policy_file`, PDFMUX_POLICY_FILE) and applies the map so `confidence` is a calibrated probability. Closed loop: calibrate → write policy → reload. No-op (identity) until a policy is written, so default behavior is unchanged. eval/calibrate.py now reports the same ECE. - On the 50-fixture set an isotonic fit cuts in-sample ECE 0.133 → 0.000. Patent: claim the closed loop over the audit features, never the math. JSON schema 1.3.0 → 1.4.0 (additive: top-level `policy_id`). Tests: +30 (test_policy.py, test_calibration.py incl. an end-to-end calibrate→reload→apply case). Full suite 734 passed / 3 skipped; ruff + format clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…on_trace Integrates the cache-key fix (from the concurrent nifty-colden session, commits 93c508e + f40b7a1) rebased onto the build-#4 stack (schema 1.4.0). - json_fmt.py: hoist the schema version into a single `SCHEMA_VERSION` constant. - result_cache.py: fold SCHEMA_VERSION into the cache key (filename gains an `__sv<hash>` segment) so a schema bump (1.2.0 → 1.3.0 → 1.4.0) invalidates every prior entry instead of serving a warm result that carries the old schema_version or lacks newly-added fields (decision_trace, provenance_tier, policy_id). Also preserves decision_trace on cache hits. This closes the stale-cache footgun that masked an eval comparison earlier in this build (back-to-back runs returned cached scores across a code change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Re-running the Publish to PyPI workflow on an already-published version fails with 400 File already exists (seen 2026-06-15, 2026-06-16). Adding skip-existing makes re-runs idempotent: PyPI rejects of pre-existing files are treated as a no-op instead of a hard failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

NameetP · 2026-06-29T08:42:23Z

Closing — this branch was inadvertently cut from a local main carrying embargoed (pre-patent-filing) commits, exposing them publicly. Deleting the branch and reopening a clean fix branched off origin/main. See replacement PR.

NameetP and others added 10 commits June 11, 2026 16:33

NameetP closed this Jun 29, 2026

NameetP deleted the fix/pypi-skip-existing branch June 29, 2026 08:42

NameetP mentioned this pull request Jun 29, 2026

Fix: skip-existing on PyPI publish (recurring 400 File already exists) #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: skip-existing on PyPI publish (recurring 400 File already exists)#17

Fix: skip-existing on PyPI publish (recurring 400 File already exists)#17
NameetP wants to merge 10 commits into
mainfrom
fix/pypi-skip-existing

NameetP commented Jun 29, 2026

Uh oh!

NameetP commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NameetP commented Jun 29, 2026

Problem

Fix

Uh oh!

NameetP commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant