-
-
Notifications
You must be signed in to change notification settings - Fork 116
Add secure zip import with folder structure preservation #765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements a safe, performant way to import zip files to a corpus while preserving the folder structure from the zip. Key features: Security: - Path traversal protection (rejects .., drive letters, absolute paths) - Zip bomb detection (compression ratio monitoring, size limits) - Symlink rejection - Resource limits (max files, folders, depth - all configurable) - Hidden file filtering (.DS_Store, __MACOSX, Thumbs.db, etc.) API: - GraphQL mutation: importZipToCorpus with rate limiting - Celery task: import_zip_with_folder_structure for async processing - Three-phase processing: validation, folder creation, document import Files added: - opencontractserver/constants/zip_import.py - opencontractserver/utils/zip_security.py - opencontractserver/tests/test_zip_security.py (49 tests) - opencontractserver/tests/test_zip_import_integration.py (17 tests) - docs/features/zip_import_with_folders_design.md Files modified: - opencontractserver/corpuses/folder_service.py (create_folder_structure_from_paths) - opencontractserver/tasks/import_tasks.py (import_zip_with_folder_structure) - config/graphql/mutations.py (ImportZipToCorpus)
Code Review: Secure Zip Import with Folder Structure PreservationSummaryThis PR implements a robust, security-focused zip import feature that preserves folder structure. The implementation is very well done with comprehensive security measures, excellent test coverage (66 total tests), and thorough documentation. Overall, this is high-quality code that follows the project's conventions. ✅ Strengths1. Exceptional Security Implementation
2. Excellent Code Organization
3. Comprehensive Testing
4. Good Performance Design
5. User Experience
🔍 Issues FoundCritical: SQL Injection Vulnerability via GenericScalarLocation: custom_meta = GenericScalar(
required=False,
description="Optional metadata to apply to all documents",
)Problem: The
Impact: High - User-controlled JSON stored in database and potentially rendered in frontend Recommendation:
High: Missing Permission Check in Folder CreationLocation: # Permission check
if not cls.check_corpus_write_permission(user, corpus):
return (
{},
0,
0,
"Permission denied: You do not have write access to this corpus",
)Problem: The Impact: Medium - Permission bypass for folder placement Recommendation: Add permission check for target_folder: if target_folder and not cls.check_folder_write_permission(user, target_folder):
return ({}, 0, 0, "Permission denied: No access to target folder")Medium: Race Condition in Quota CheckingLocation: Problem: The quota is checked twice - once at the start and again during processing - but both use non-atomic Impact: Medium - Users could exceed document limits through race condition Recommendation: Use Medium: Incomplete Error HandlingLocation: Problem: Individual document processing errors are caught generically but don't provide file-specific context:
Recommendation:
Low: Hidden File Detection Could Be StricterLocation: Problem: Hidden file detection relies on simple string matching: if filename.startswith("."):
return TrueThis won't catch:
Recommendation: Check all path components, not just filename: path_parts = path.split("/")
if any(part.startswith(".") for part in path_parts):
return TrueLow: Memory Concerns for Large FilesLocation: with import_zip.open(entry.original_path) as file_handle:
file_bytes = file_handle.read()Problem: Files up to 100MB are read entirely into memory. With batches of 50 files, this could theoretically consume 5GB of memory if many large files are processed. Impact: Low - Could cause memory issues on resource-constrained servers Recommendation: Consider streaming large files directly to Django's file storage or add a warning in documentation about memory requirements. 🔧 Code Quality Suggestions1. Add Type Hints to Celery Task Returndef import_zip_with_folder_structure(...) -> dict:Could be more specific: from typing import TypedDict
class ImportResult(TypedDict):
job_id: str
success: bool
completed: bool
# ... rest of fields
def import_zip_with_folder_structure(...) -> ImportResult:2. Magic Number in Compression RatioLocation: if entry.compress_size > 0 and ratio > ZIP_MAX_COMPRESSION_RATIO:The ratio calculation could overflow for very small compressed sizes. Consider adding minimum compressed size threshold. 3. Improve Transaction GranularityThe entire folder creation is in one transaction (good for atomicity) but if it fails halfway through a deep structure, there's no way to resume. Consider:
4. Missing TestsAccording to the PR description, manual testing is incomplete:
Recommend adding:
📊 Performance Considerations
🔐 Security AssessmentOverall security posture is very strong. The implementation follows OWASP best practices: ✅ Input Validation: Comprehensive path validation 📝 DocumentationThe design document (
Suggestion: Add API documentation to ✨ Best Practices Followed
🎯 Recommendations SummaryMust Fix Before Merge:
Should Fix:
Nice to Have:
Overall AssessmentThis is excellent work with thoughtful security design and comprehensive testing. The code quality is high and follows the project's conventions well. With the security issues addressed (custom_meta validation and target_folder permissions), this will be a robust addition to the platform. Recommendation: Request changes for the security issues, then approve. The implementation quality is very strong overall. Review completed by Claude Code |
When importing a zip file with documents that already exist at the same path in the corpus, the new document becomes a new version (v2, v3, etc.) rather than being skipped or overwriting the existing document. Changes: - Pass full document path to add_document() for versioning support - Track upversioned documents in results (files_upversioned, upversioned_paths) - Add 2 new upversioning tests (19 integration tests total) - Fix unused imports in test files
PR Review: Secure Zip Import with Folder Structure PreservationOverall AssessmentThis is a well-architected, security-focused feature with comprehensive test coverage. The implementation follows best practices for handling untrusted file uploads and demonstrates excellent defensive programming. The code quality is high and aligns well with the repository's standards. Strengths1. Excellent Security Design ✅
2. Robust Error Handling ✅
3. Performance Considerations ✅
4. Test Coverage ✅
5. Documentation ✅
Issues & Recommendations🔴 Critical: Accidental Binary File CommitFile: This binary SQLite database file should not be committed to the repository. It contains IPython session history and changes on every development session. Fix Required: git rm --cached .ipython/profile_default/history.sqlite
echo '.ipython/' >> .gitignore
git commit --amend🟡 Medium: Potential Race Condition in Folder CreationFile: # Check if folder with same name exists under this parent
existing = CorpusFolder.objects.filter(
corpus=corpus,
parent=parent,
name=folder_name,
).first()Issue: Between checking for folder existence and creating a new one, another process could create the same folder, leading to a database constraint violation. Recommendation: Use existing, created = CorpusFolder.objects.get_or_create(
corpus=corpus,
parent=parent,
name=folder_name,
defaults={'creator': user}
)
if not created:
reused_count += 1
logger.debug(f"Reusing existing folder by name: {path}")
else:
created_count += 1
logger.debug(f"Created new folder: {path} (id={existing.id})")
folder_map[path] = existing
existing_by_path[path] = existing🟡 Medium: Document Upversioning Logic May Need ValidationFile: The code relies on Questions:
Recommendation: Consider adding error handling around upversioning: try:
added_doc, status, doc_path = corpus_obj.add_document(
document=document,
user=user_obj,
folder=doc_folder,
path=doc_path_str,
)
except PermissionError as e:
results["files_errored"] += 1
results["errors"].append(
f"Permission error for {entry.sanitized_path}: {str(e)}"
)
continue🟢 Minor: Inconsistent Variable NamingFile: base64_img_bytes = base64_file_string.encode("utf-8")Issue: Variable named Recommendation: Rename for clarity: base64_zip_bytes = base64_file_string.encode("utf-8")
decoded_file_data = base64.decodebytes(base64_zip_bytes)🟢 Minor: Magic Number in Compression Ratio LoggingFile: if ratio > ZIP_MAX_COMPRESSION_RATIO:
logger.warning(
f"High compression ratio ({ratio:.1f}:1) for file: "
f"{sanitized_path}"
)
# We don't reject here, but log for monitoringObservation: The code logs but doesn't reject high compression ratios. The comment says "The file will be validated during streaming extraction" but I don't see streaming extraction validation in the actual document processing phase. Recommendation: Either:
🟢 Minor: Frontend Test Change Seems UnrelatedFile: This change appears to be a test fix unrelated to zip import functionality. While not problematic, it would be better in a separate PR for cleaner history. Recommendation: Consider moving to a separate PR or document why it's included (e.g., "Fixed flaky test discovered during development"). Code Quality NotesExcellent Patterns Observed:
Follows Repository Standards:✅ No magic numbers (uses constants) Security ValidationI reviewed the security measures against OWASP guidelines for file upload handling:
One consideration: The code doesn't validate that Performance AnalysisExpected Performance:
Potential Bottlenecks:
Optimization Opportunities (for future):
Test Coverage AssessmentWhat's Well Tested:
What Could Use More Tests:
Recommendations SummaryMust Fix Before Merge:
Should Fix:
Nice to Have:
ConclusionThis is high-quality, production-ready code with only one critical issue (the binary file commit). The security implementation is exemplary, and the feature design is well thought out. Once the Recommendation: Approve with requested changes Great work on this feature! The security-first approach and comprehensive testing demonstrate excellent engineering practices. |
- Remove .ipython/history.sqlite from tracking and add to .gitignore - Fix race condition in folder creation using get_or_create() - Add PermissionError handling around document upversioning - Rename base64_img_bytes to base64_zip_bytes for clarity - Update compression ratio comment to explain monitoring-only approach
PR Review: Secure Zip Import with Folder Structure PreservationSummaryThis is an excellent, production-ready implementation of a secure zip import feature. The code demonstrates strong security awareness, clean architecture, and comprehensive test coverage. The implementation follows the project's conventions and best practices as outlined in CLAUDE.md. ✅ Strengths1. Outstanding Security ImplementationThe security utilities in
Notable highlight: The compression ratio monitoring approach (lines 449-460) is excellent - it logs warnings without blocking legitimate highly-compressible files, since the total/individual size limits provide sufficient protection. 2. Clean Three-Phase Processing ArchitectureThe Celery task (
This allows for graceful partial success and comprehensive error reporting. 3. Excellent Test Coverage
4. Comprehensive DocumentationThe design document (
5. Proper CHANGELOG MaintenanceCHANGELOG.md follows Keep a Changelog format with file locations and line numbers - exactly as required by CLAUDE.md. 6. Good Permission Handling
🔍 Issues FoundCRITICAL: Race Condition in Folder Creation ✅ FIXED
UPDATE: This was fixed in commit Minor Issues
💡 Suggestions (Non-Blocking)1. Consider Adding Metrics/MonitoringFor a feature this security-critical, consider adding metrics: # Example in import_tasks.py
from django.core.cache import cache
cache.incr('zip_import.total_attempts')
cache.incr('zip_import.validation_failures')
cache.incr('zip_import.zip_bomb_detections')2. Consider Adding Admin ControlsAdd Django admin interface for:
3. Document Upversioning BehaviorThe upversioning feature (lines 812-819, 864-887) is excellent but could use more documentation:
4. Consider Progress ReportingWith potential for 1000 files, users might want progress updates:
🔒 Security AssessmentOverall: STRONG ✅ The implementation demonstrates defense-in-depth:
No security vulnerabilities identified. 📊 Code Quality Assessment
✅ Pre-Merge Checklist
🎯 RecommendationAPPROVE with minor suggestions This PR is ready to merge. The implementation is secure, well-tested, and follows all project conventions. The minor issues noted above are non-blocking and can be addressed in follow-up PRs if desired. The code quality is exceptional, particularly the security implementation and test coverage. This sets a great example for future features. Excellent work! 🚀 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…th-folder-structure # Conflicts: # CHANGELOG.md # frontend/src/hooks/__tests__/useAgentChat.reconnection.test.ts
… support - Add BulkImportModal component for folder-aware ZIP imports - Implement metadata file parser (CSV/JSON) for document descriptions and custom metadata - Implement relationship file parser for document-to-document relationships - Extend import_tasks.py with folder structure preservation and file parsing - Add comprehensive security tests for ZIP extraction - Add integration tests for ZIP import with metadata and relationships
Pull Request Review: Secure Zip Import with Folder Structure PreservationSummaryThis is an excellent, production-ready implementation of bulk ZIP import functionality with comprehensive security measures, thorough testing, and well-documented architecture. The PR demonstrates strong engineering practices across the stack. ✅ Strengths1. Security Implementation (Outstanding)The security model in
File: The validation-first approach (Phase 1 validates entire ZIP before extraction) is the correct security pattern. 2. Test Coverage (Excellent)
3. Architecture & Code Quality
4. Documentation (Comprehensive)
5. Error Handling
🔍 Code Quality ObservationsPositive Patterns
Frontend Quality
🐛 Potential Issues & Suggestions1. Backend Tests Not VerifiedPriority: High Recommendation: # Run security tests
docker compose -f test.yml run django pytest opencontractserver/tests/test_zip_security.py -n 4 --dist loadscope
# Run integration tests
docker compose -f test.yml run django pytest opencontractserver/tests/test_zip_import_integration.py -n 4 --dist loadscope2. Error Handling: Transaction RollbackPriority: Medium Issue: If folder creation succeeds but document processing fails partway through, folders remain created but empty. Current behavior: Expected and acceptable (folders are harmless), but should be documented. Consider: Adding a warning in results if folders were created but no documents processed successfully. 3. Memory Consideration for Large ZIPsPriority: Low Current approach: Extracts each file into memory ( Consideration: With 500MB total limit and 100MB per-file limit, max memory footprint is bounded at 100MB per document. This is acceptable, but worth monitoring in production. Note: Current implementation is safe given the enforced size limits. 4. Relationship Path Normalization Edge CasePriority: Low Observation: Impact: Relationships CSV must not include leading Verification needed: Check if this is intentional. If so, document clearly in Example: # In relationship file parser
normalized = "/contracts/doc.pdf" # Has leading /
# In zip manifest
entry.sanitized_path = "contracts/doc.pdf" # No leading /
# In import task
document_path_map[entry.sanitized_path] = doc # Key has no leading /Recommendation: Ensure consistency between path formats in 5. Missing Relationship Type HandlingPriority: Medium Code: relationship_data = {}
if rel.notes:
relationship_data["note"] = rel.notes
DocumentRelationship.objects.create(
# ...
relationship_type=rel.relationship_type, # "RELATIONSHIP" or "NOTES"
data=relationship_data if relationship_data else None,
# ...
)Question: What are valid values for Recommendation: Verify 6. Frontend Component TestingPriority: High Recommendation: Add component tests following project patterns: # Test file location
frontend/tests/components/widgets/modals/BulkImportModal.test.tsx
# Run with required flag to prevent hanging
cd frontend && yarn test:ct --reporter=list -g "BulkImportModal"Test cases:
7. Permission Check TimingPriority: Low Current: Permission check happens in mutation before async task launch. Consider: Task could theoretically run after user loses permission (race condition between check and execution). Impact: Very low probability, but worth noting for audit. Mitigation: Task could re-verify permission at start (defensive programming). 8. Usage-Capped User Quota CheckPriority: Medium Issue: Quota check happens at task start, but processing continues until quota exceeded during batching. Scenario: User has 10 documents remaining in quota but uploads ZIP with 100 documents. First 10 succeed, rest fail. Current behavior: Not clearly defined in code. Recommendation: Clarify expected behavior in docs and implement consistent handling:
🎯 Best Practices Followed
🔒 Security ReviewValidated Protections
Recommended Security Enhancements
📋 Pre-Merge Checklist
🚀 Recommended Next StepsBefore Merge
Post-Merge
📊 Overall AssessmentQuality Score: 9.5/10 This is production-ready code with excellent security design, comprehensive testing, and thorough documentation. The few issues noted are minor and should not block merge, but should be addressed for polish. Key Achievements
Minor Gaps
Recommendation: ✅ Approve with minor follow-ups The security implementation alone makes this PR exemplary. The relationship and metadata parsing features add significant value while maintaining code quality. Great work! 🙏 AcknowledgmentThis is a substantial feature addition (7739 additions) that maintains the high quality standards of the OpenContracts codebase. The attention to security, testing, and documentation is commendable. Reviewed by: Claude Code |
…tests - Unify upload split button design in FolderToolbar with cleaner CSS - Fix FolderToolbar test to use exact match for Upload button - Update DocumentRelationship permission tests to reflect inherited permission model (permissions come from documents, not individual guardian permissions on relationship) - Fix document-corpus association in tests to use DocumentPath instead of corpus.documents.add()
Pull Request Review: Secure Zip Import with Folder StructureOverviewThis is a well-architected and thoroughly implemented feature that adds secure zip import functionality with folder structure preservation. The implementation demonstrates strong security awareness, comprehensive testing, and adherence to the project's architectural patterns. Strengths1. Excellent Security Design ✅
2. Strong Code Quality ✅
3. Robust Error Handling ✅
4. Performance Considerations ✅
Issues and RecommendationsCritical Issues 🚨1. Missing Transaction Rollback in Mutation (Security/Data Integrity)File: The mutation creates a with transaction.atomic():
temporary_file = TemporaryFileHandle.objects.create()
temporary_file.file = ContentFile(decoded_file_data, name=f"zip_import_{job_id}.zip")
temporary_file.save()
# Task launch happens OUTSIDE transaction
transaction.on_commit(lambda: chain(...).apply_async())Issue: If the task fails to queue, the temporary file orphans in the database. Recommendation: Add try/except around the entire operation and ensure cleanup on failure. 2. Base64 Decoding Without Size Validation (DoS Risk)File: decoded_file_data = base64.decodebytes(base64_zip_bytes)Issue: Decodes the entire base64 string into memory before any size validation. An attacker could send a multi-GB base64 string, consuming server memory before zip validation occurs. Recommendation: Add a check on # Add before decoding
max_encoded_size = int(ZIP_MAX_TOTAL_SIZE_BYTES * 1.4) # Base64 overhead + buffer
if len(base64_file_string) > max_encoded_size:
return ImportZipToCorpus(
ok=False,
message=f"Encoded file exceeds maximum size",
job_id=None,
)High Priority Issues
|
- Add base64 size validation before decoding to prevent memory exhaustion (validates encoded string doesn't exceed ZIP_MAX_TOTAL_SIZE_BYTES * 1.4) - Replace while loops with regex in normalize_path() for O(n) performance instead of O(n*m) with malicious input containing many consecutive slashes
Code Review for PR #765 - Secure Zip Import with Folder StructureI've reviewed this pull request and found it to be well-architected, thoroughly tested, and production-ready. This is an impressive implementation of a complex feature with excellent security considerations. Below is my detailed analysis. ✅ Strengths1. Security Implementation (Outstanding)The security model is exceptional and covers all major attack vectors:
The security constants are properly externalized ( 2. Code Quality & Architecture
3. Test Coverage (Exceptional)3,130 lines of test code across 4 test files:
This demonstrates thorough validation of edge cases, security boundaries, and integration scenarios. 4. Documentation
5. Frontend Implementation
🔍 Observations & Minor Suggestions1. Base64 Size Validation (
|
Summary
./secrets/test.pdfwill appear atsecrets/test.pdfin the corpusSecurity Features
..sequences, drive letters, absolute paths.DS_Store,__MACOSX,Thumbs.db, etc.API
GraphQL Mutation:
importZipToCorpusCelery Task:
import_zip_with_folder_structureFiles Changed
New Files
opencontractserver/constants/zip_import.py- Configurable security limitsopencontractserver/utils/zip_security.py- Path sanitization and validationopencontractserver/tests/test_zip_security.py- 49 security testsopencontractserver/tests/test_zip_import_integration.py- 17 integration testsdocs/features/zip_import_with_folders_design.md- Design documentationModified Files
opencontractserver/corpuses/folder_service.py- Addedcreate_folder_structure_from_paths()opencontractserver/tasks/import_tasks.py- Added Celery taskconfig/graphql/mutations.py- Added GraphQL mutationTest plan