Skip to content

Tags: notoriouslab/doc-cleaner

Tags

v1.2.0

Toggle v1.2.0's commit message
feat: add DXF, PPTX, PPT, DOC parsers + security hardening (v1.2.0)

New format support: DXF (ezdxf), PPTX (python-pptx), PPT/DOC (macOS textutil).
YAML frontmatter unified to camelCase (source_path → sourcePath, BREAKING).
Security: fix YAML newline injection, add entity/zip-bomb/timeout guards.
Deduplicate textutil logic into shared parsers/_textutil.py.

v1.1.0

Toggle v1.1.0's commit message
security: red team fixes — glob escape, ReDoS prevention, pattern val…

…idation, ODL density check

- HIGH: glob.escape() prevents special chars in filenames from deleting wrong dirs
- HIGH: strip_patterns uses compile+search instead of regex concat (ReDoS prevention)
- MEDIUM: validate_patterns() now covers ad_strip_patterns at startup
- MEDIUM: ODL classifier adds per-page density check (density < 20 falls back to fitz)
- MEDIUM: cutoff join uses non-capturing groups to avoid group numbering conflicts
- Simplify odl_available(), extract_text_odl cleanup, classifier ODL branch

v1.0.3

Toggle v1.0.3's commit message
fix: DOCX header heuristic, XLSX wide table truncation, code cleanup …

…(v1.0.3)

- DOCX: fix header detection misclassifying date/amount rows (e.g. "2024-01-15",
  "1,234.56") as non-header. Now defaults to has_header=True unless ALL cells
  are empty or plain integers — safer for CJK financial tables.
- XLSX: add safety truncation for extremely wide tables where even 1 row
  exceeds the per-sheet char budget after binary search.
- Move `import time` from retry loop to file top (code smell fix).
- Remove extra blank line in pdf.py.
- Bump version to 1.0.3.

Addresses findings from expert code review (賈詡 audit).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

v1.0.1

Toggle v1.0.1's commit message
fix(security): harden P2–P4 issues from hacker audit

- P2: Ollama host whitelist — only localhost/127.0.0.1/::1 allowed (SSRF prevention)
- P3: YAML frontmatter tags escape double quotes (injection prevention)
- P3: --password CLI capped at 1024 chars
- P4: collect_files() uses os.path.realpath() + symlink escape check
- P4: DOCX textutil fallback uses TemporaryDirectory context manager (TOCTOU fix)
- Bump version to 1.0.1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>