Skip to content

Fix: skip-existing on PyPI publish (recurring 400 File already exists)#17

Closed
NameetP wants to merge 10 commits into
mainfrom
fix/pypi-skip-existing
Closed

Fix: skip-existing on PyPI publish (recurring 400 File already exists)#17
NameetP wants to merge 10 commits into
mainfrom
fix/pypi-skip-existing

Conversation

@NameetP

@NameetP NameetP commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Problem

The Publish to PyPI workflow (.github/workflows/publish.yml) fails with 400 File already exists whenever it re-runs against a package version that is already on PyPI. Observed on 2026-06-15 and 2026-06-16.

This happens because PyPI rejects uploads of artifacts that already exist (immutable releases). A re-run — manual retry, a re-published GitHub release, or any workflow restart on the same version — hits this hard failure.

Fix

Add skip-existing: true to the pypa/gh-action-pypi-publish step. With this flag, the action treats already-present files as a no-op instead of erroring, making re-runs idempotent. Net-new files still upload normally.

      - uses: pypa/gh-action-pypi-publish@release/v1
        with:
          skip-existing: true

This is the official, safe remedy recommended by the pypa action for exactly this scenario. No source code or build logic touched — CI/CD only.

🤖 Generated with Claude Code

NameetP and others added 10 commits June 11, 2026 16:33
…packaging

The 0.1.0 published 2026-03-30 shipped the reader as a flat top-level
module (llama_index_readers_pdfmux), so the conventional
`from llama_index.readers.pdfmux import PDFMuxReader` import failed —
the package was a broken LlamaIndex integration for 2+ months.

0.1.1 ships the correct llama_index/readers/pdfmux/ namespace package
that LlamaIndex + LlamaHub expect, conforms PDFMuxReader to BaseReader,
adds extra_info metadata merging, tightens dep pins, and adds tests
(4 passing). Published to PyPI; LlamaIndex no longer accepts monorepo
integration PRs (auto-closed), so this is maintained here + published
independently. See PUBLISH.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lama-index-readers-pdfmux

The README tagline claimed "LangChain + LlamaIndex loaders" but never
gave the package names, install commands, or usage — so a RAG dev had no
path from "it exists" to "pip install it". Both packages are live on PyPI
(langchain-pdfmux 0.2.0, llama-index-readers-pdfmux 0.1.1); this documents
the install + usage + the shared confidence-in-metadata contract that lets
users filter low-confidence chunks before indexing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds PageDecision + RepairAttempt types and threads a per-page decision record (audit class/score -> budget verdict -> cascade outcomes incl. accepted/rejected repair attempts) through _multipass_extract -> process -> JSON output under an additive 'decision_trace' key. Backward-compatible additive schema bump 1.1.0 -> 1.2.0. Build #1 of the patent enabling builds (Appendix D).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Build #3 of the patent-enabling plan (claim family F — tiered provenance).
Carries the recognition tier that produced each page's final text, plus the
sub-page geometry (region bboxes) for the one tier that has it, end to end.

- types.py: PageResult gains `provenance_tier` ("native"|"region"|"page"|"llm")
  and `regions` (the OCR'd WeakRegions). PageDecision gains `provenance_tier`
  and `region_bboxes`. Default tier is "page" — a bare PageResult under-claims
  rather than asserting spurious glyph geometry (honest tiering is the point).
- regions.py: region_ocr_page now returns the WeakRegions it recovered text
  from, so the caller can attach their bboxes as "region"-tier provenance.
- pipeline.py: _multipass_extract tags each recovered page native/region/page/
  llm and threads regions + bboxes into the PageResult and PageDecision;
  native pages carry no region geometry. Both process() reconstruction points
  (image-table OCR, Arabic bidi) preserve the new fields.
- json_fmt.py: schema 1.2.0 -> 1.3.0; decision_trace entries now carry
  provenance_tier + region_bboxes (additive).
- Also removes two latent lint issues from the un-pushed decision-trace base
  (quoted annotations, an over-long comment) so the stack is CI-green.

Tests: +5 (TestTieredProvenance, deterministic, no OCR engine needed) +
decision-trace assertions. Full suite 689 passed / 3 skipped. Behavior
unchanged on clean digital PDFs (all pages "native", no escalation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Build #2 of the patent-enabling plan (claim family 4 — the strongest new,
narrowly-defensible island). Replaces the inconsistent accept-a-repair logic
(char-length in three pipeline paths, raw confidence in agentic) with ONE
calibrated gate, and retains every rejected candidate on the decision trace.

audit.py — new `accept_repair(original, candidate, *, additive=...)`, the §5.8
guard, returning (accepted, score_before, score_after, reason):
  1. Additive-patch-only for trusted native spans — a page whose audit score
     clears NATIVE_TRUST_THRESHOLD (0.80, env PDFMUX_NATIVE_TRUST) may only be
     augmented (region OCR), never wholesale-replaced by full OCR / LLM.
  2. Hard-fail signals (`_repair_hard_fail`) — reject regardless of score delta
     if the candidate introduces mojibake, collapses the alphabetic ratio,
     suspiciously shortens (>50%), or loses headings/tables.
  3. Calibrated audit-delta gate — full replacement must strictly beat the
     original by PDFMUX_REPAIR_MARGIN (default 0.0); additive patches need only
     be non-decreasing. Both arms are monotonic: quality never drops.
  Plus `repair_score_delta` — the single calibrated signal both paths share.

pipeline.py — `_multipass_extract` routes region OCR (additive), full-page OCR,
and the LLM through `accept_repair`, and records EVERY attempt (accepted AND
rejected) into the page's PageDecision. A page that stays native still carries
its rejected attempts — rejected-patch retention is the cleanest trace
differentiator and was the unclaimed white space the plan identified.

agentic.py — the `:123` raw-confidence gate now uses `accept_repair` too, so a
longer-but-worse re-extraction can no longer win on length/confidence alone.

Tests: +15 (tests/test_repair_gate.py — gate semantics, hard-fail signals, the
margin knob, and deterministic accepted/rejected integration via a monkeypatched
region-OCR). Updated test_agentic.py fixtures to realistic text so the calibrated
gate is genuinely exercised. Full suite 704 passed / 3 skipped.

No quality regression: eval/run_eval + eval/calibrate are byte-identical before
and after — strict gate holds at 0.75 / precision 1.000 / recall 0.893.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mbargoed)

Documents decision trace (1.2.0), tiered provenance (1.3.0), and the monotonic
repair guard under an explicit "HELD FOR PATENT FILING — DO NOT PUBLISH" header,
plus the two new env vars. Local-only, like the code it describes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The flag was parsed but never forwarded — extract_json takes no use_cache arg —
so back-to-back eval runs silently returned stale cached scores. Set the env
toggle before importing pdfmux so --no-cache actually disables the result cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ema 1.4.0)

Build #4 of the patent-enabling plan — Part A (policy-as-data) + Part B
(closed-loop calibration). Strengthens claim families B/C/G and enables the
task-conditioned + calibrated dependent claims.

Part A — versioned policy object (policy.py):
- One frozen `Policy` holds every extraction tunable: audit thresholds +
  score_page penalties/bands, OCR-budget params, Arabic + table thresholds,
  strict gate, and the repair margin/trust/hard-fail tolerances. Carries a
  `policy_id` (pdfmux-policy-v1.7), emitted in JSON output for reproducibility.
- `load_policy()` folds PDFMUX_* env overrides in at load time and suffixes the
  policy_id with a content hash when any value changes — a tuned run never
  masquerades as canonical.
- audit.py / detect.py / pipeline.py now read from the policy; the old module
  constants stay as backwards-compatible aliases. Default policy reproduces the
  historical constants exactly — behavior-neutral (eval byte-identical).

Part B — runtime calibration loop (calibration.py + `pdfmux calibrate`):
- Stdlib isotonic (PAVA) + Platt fitters, Expected Calibration Error, and
  `fit_calibration` (annotates ECE before/after). The map is monotonic — a
  higher raw score never yields a lower probability.
- `pdfmux calibrate <labelled-dir> --method isotonic|platt --target ... --out`
  scores the labelled PDFs on their RAW confidence, fits the map, prints a
  reliability table + ECE, and writes a versioned policy file.
- Runtime reloads it (`load_policy_file`, PDFMUX_POLICY_FILE) and applies the
  map so `confidence` is a calibrated probability. Closed loop: calibrate →
  write policy → reload. No-op (identity) until a policy is written, so default
  behavior is unchanged. eval/calibrate.py now reports the same ECE.
- On the 50-fixture set an isotonic fit cuts in-sample ECE 0.133 → 0.000.
  Patent: claim the closed loop over the audit features, never the math.

JSON schema 1.3.0 → 1.4.0 (additive: top-level `policy_id`). Tests: +30
(test_policy.py, test_calibration.py incl. an end-to-end calibrate→reload→apply
case). Full suite 734 passed / 3 skipped; ruff + format clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on_trace

Integrates the cache-key fix (from the concurrent nifty-colden session,
commits 93c508e + f40b7a1) rebased onto the build-#4 stack (schema 1.4.0).

- json_fmt.py: hoist the schema version into a single `SCHEMA_VERSION` constant.
- result_cache.py: fold SCHEMA_VERSION into the cache key (filename gains an
  `__sv<hash>` segment) so a schema bump (1.2.0 → 1.3.0 → 1.4.0) invalidates
  every prior entry instead of serving a warm result that carries the old
  schema_version or lacks newly-added fields (decision_trace, provenance_tier,
  policy_id). Also preserves decision_trace on cache hits.

This closes the stale-cache footgun that masked an eval comparison earlier in
this build (back-to-back runs returned cached scores across a code change).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-running the Publish to PyPI workflow on an already-published version
fails with 400 File already exists (seen 2026-06-15, 2026-06-16). Adding
skip-existing makes re-runs idempotent: PyPI rejects of pre-existing
files are treated as a no-op instead of a hard failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@NameetP

NameetP commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

Closing — this branch was inadvertently cut from a local main carrying embargoed (pre-patent-filing) commits, exposing them publicly. Deleting the branch and reopening a clean fix branched off origin/main. See replacement PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant