Skip to content

fix(parse): prevent Zip Slip path traversal in _extract_zip (CWE-22)#89

Merged
MaojiaSheng merged 1 commit intovolcengine:mainfrom
ze-mu-zhou:fix/zip-slip-path-traversal
Feb 7, 2026
Merged

fix(parse): prevent Zip Slip path traversal in _extract_zip (CWE-22)#89
MaojiaSheng merged 1 commit intovolcengine:mainfrom
ze-mu-zhou:fix/zip-slip-path-traversal

Conversation

@ze-mu-zhou
Copy link
Contributor

Description

CodeRepositoryParser._extract_zip() calls zipfile.extractall() without validating member paths. An attacker can craft a malicious zip containing entries like ../../evil.txt, /etc/passwd, C:\evil.txt, ..\..\evil.txt, \server\share\evil.txt, or symlink entries to write files outside the target directory during extraction (Zip Slip, CWE-22).

This PR replaces extractall() with per-member extraction and implements 5-layer defense-in-depth:

  1. Skip directory entries (info.is_dir() + stat.S_ISDIR) — directories are created implicitly when extracting their children
  2. Skip symlink entries (stat.S_ISLNK) — prevents symlink-based escapes to external paths
  3. Pre-extraction rejection of raw filename — raise ValueError if raw path contains .. components, is absolute, or has a drive letter
  4. Post-normalization verification — normalize path using the same algorithm as zipfile._extract_member, then verify with Path.is_relative_to()
  5. Post-extraction verification — verify the actual path written to disk, clean up and raise ValueError if it escapes

Related Issue

No existing issue. Vulnerability discovered via code audit.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Replace zip_ref.extractall(target_dir) with per-member safe extraction and 5-layer path validation (code.py +50/-2)
  • Add import stat and from pathlib import PurePosixPath
  • Add tests/misc/test_extract_zip.py (11 regression tests)
  • Update tests/README.md with missing entries for the misc/ test directory

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

11 test cases covering the following scenarios:

Category Test Case Expected Behavior
Normal extraction Multi-file zip, stem name return No regression, files extracted correctly
../ traversal ../../evil.txt raise ValueError, no file escape
Absolute path /etc/passwd raise ValueError, no file escape
Nested traversal foo/../../evil.txt raise ValueError, no file escape
Windows drive path C:\evil.txt raise ValueError, no file escape
Backslash traversal ..\..\evil.txt raise ValueError, no file escape
UNC path \server\share\evil.txt raise ValueError, no file escape
Symlink entry external_attr set to S_IFLNK Silently skipped, not extracted
.. normalization ./.. (contains .., normalizes to empty) raise ValueError (fail-fast, .. rejected before normalization)
Directory entry mydir/ Silently skipped, child files extracted normally

All attack vectors verified as successfully blocked on Windows + Python 3.13.

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

Behavioral impact: No change for legitimate zip files. extractall() is replaced with equivalent per-member extract() calls (internally, extractall is just a loop over extract). The only new behavior is raising ValueError on malicious entries.

Performance impact: Negligible. All added checks are in-memory string operations (split, replace, is_relative_to) with no additional I/O.

Fail-fast design decision: The implementation raises ValueError immediately upon detecting a malicious entry, rather than skipping it and continuing. A zip containing malicious entries is considered untrustworthy as a whole — continuing to extract remaining members carries risk. Fail-fast ensures the caller is notified and can handle the exception.

Python compatibility: The project requires Python 3.9+, and zipfile's built-in ../ sanitization was only added in Python 3.12+. On Python 3.9–3.11, extractall() performs no path traversal protection at all, making this fix necessary. Even on Python 3.12+, this fix provides independent defense covering attack vectors that zipfile does not handle (symlink entries, absolute paths, drive letter paths).

Replace unsafe zipfile.extractall() with per-member extraction
and 5-layer defense-in-depth validation.
@ze-mu-zhou ze-mu-zhou marked this pull request as ready for review February 7, 2026 05:41
@MaojiaSheng
Copy link
Collaborator

good job

@MaojiaSheng MaojiaSheng merged commit 0b6f864 into volcengine:main Feb 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants