Tags: notoriouslab/doc-cleaner
Tags
feat: add DXF, PPTX, PPT, DOC parsers + security hardening (v1.2.0) New format support: DXF (ezdxf), PPTX (python-pptx), PPT/DOC (macOS textutil). YAML frontmatter unified to camelCase (source_path → sourcePath, BREAKING). Security: fix YAML newline injection, add entity/zip-bomb/timeout guards. Deduplicate textutil logic into shared parsers/_textutil.py.
security: red team fixes — glob escape, ReDoS prevention, pattern val… …idation, ODL density check - HIGH: glob.escape() prevents special chars in filenames from deleting wrong dirs - HIGH: strip_patterns uses compile+search instead of regex concat (ReDoS prevention) - MEDIUM: validate_patterns() now covers ad_strip_patterns at startup - MEDIUM: ODL classifier adds per-page density check (density < 20 falls back to fitz) - MEDIUM: cutoff join uses non-capturing groups to avoid group numbering conflicts - Simplify odl_available(), extract_text_odl cleanup, classifier ODL branch
fix: DOCX header heuristic, XLSX wide table truncation, code cleanup … …(v1.0.3) - DOCX: fix header detection misclassifying date/amount rows (e.g. "2024-01-15", "1,234.56") as non-header. Now defaults to has_header=True unless ALL cells are empty or plain integers — safer for CJK financial tables. - XLSX: add safety truncation for extremely wide tables where even 1 row exceeds the per-sheet char budget after binary search. - Move `import time` from retry loop to file top (code smell fix). - Remove extra blank line in pdf.py. - Bump version to 1.0.3. Addresses findings from expert code review (賈詡 audit). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix(security): harden P2–P4 issues from hacker audit - P2: Ollama host whitelist — only localhost/127.0.0.1/::1 allowed (SSRF prevention) - P3: YAML frontmatter tags escape double quotes (injection prevention) - P3: --password CLI capped at 1024 chars - P4: collect_files() uses os.path.realpath() + symlink escape check - P4: DOCX textutil fallback uses TemporaryDirectory context manager (TOCTOU fix) - Bump version to 1.0.1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>