Skip to content

Conversation

@mschwoer
Copy link
Contributor

@mschwoer mschwoer commented Dec 19, 2025


To-do list (outside contributers only)

@claude
Copy link

claude bot commented Dec 19, 2025

Code Review - PR #106: Add output path to model for processed results

Summary

This PR adds a new output_path field to the RawFile model to store the absolute path to the output directory for processed results. The change is straightforward and follows the existing pattern in the codebase.

Code Quality ✅

Positive aspects:

  • The field definition follows the existing pattern for path fields (e.g., backup_base_path)
  • Appropriate max_length=256 is used, which is double the length of backup_base_path (128), providing adequate space for longer output paths
  • Uses default=None consistently with other optional path fields
  • Clear inline comment explaining the field's purpose
  • Placement in the model is logical (near related path fields)

Potential Issues ⚠️

1. Missing integration with update_raw_file() function

The new field is not included in shared/db/interface.py:update_raw_file() (lines 147-181). This function is the primary interface for updating RawFile entities and currently supports these fields:

  • new_status, status_details, size, file_info, backup_base_path, backup_status, s3_upload_path, instrument_file_status

Recommendation: Add output_path parameter to update_raw_file() function to maintain consistency with the codebase pattern. Without this, users will need to use direct MongoEngine update methods, which is inconsistent with how other fields are updated.

2. No usage/setter code in this PR

I don't see where output_path will be set in the codebase. Looking at airflow_src/dags/impl/processor_impl.py:prepare_quanting(), the output path is calculated (lines 96-100) but never stored back to the database:

relative_output_path = get_output_folder_rel_path(raw_file, project_id_or_fallback)
output_path = get_path(YamlKeys.Locations.OUTPUT) / relative_output_path

Recommendation: Consider including the code that actually sets this field, or at minimum document in the PR description when/where this field will be populated in a follow-up PR.

3. No test coverage

The new field is not covered by tests in shared/tests/db/test_models.py. The test file currently only tests the parse_file_info_item() function.

Recommendation: Add a test to verify:

  • The field can be set and retrieved correctly
  • The default value is None
  • The max_length constraint is enforced
  • The field accepts valid absolute paths

4. No database migration script

Unlike backup_base_path which has a migration script in airflow_src/plugins/_db_migrations/_add_file_info.py, there's no migration script for this new field.

Question: Will existing RawFile documents need their output_path populated? If so, consider adding a migration script following the pattern in _add_file_info.py.

Performance Considerations ✅

  • No performance concerns. Adding an optional string field has minimal impact.
  • The field is not indexed, which is appropriate if it won't be used for queries.

Security Concerns ✅

  • No direct security concerns with the field definition itself.
  • However, ensure that wherever this field is set, the path values are properly validated to prevent path traversal attacks (similar to the validation in processor_impl.py:_check_content()).

Architecture Considerations 💡

Question about field naming:
The field is named output_path which stores an "absolute path to the output directory". Consider whether this should be:

  1. Absolute path (as currently documented) - but this might cause issues if the file system structure changes
  2. Relative path (like backup_base_path which is actually absolute, but there's also relative path handling)

Looking at the pattern in processor_impl.py, there's both relative_output_path and output_path. Consider if you need to store both or just the relative path (which can be reconstructed into absolute path when needed).

Recommendations Summary

Must address:

  1. ⚠️ Add output_path parameter to update_raw_file() function in shared/db/interface.py for consistency

Should address:

  1. Add test coverage for the new field
  2. Include or document when/where this field will be populated
  3. Consider if a migration script is needed for existing records

Consider:

  1. Document whether absolute or relative paths should be stored and why
  2. Clarify the relationship with backup_base_path and output path calculations

Overall Assessment

This is a clean, simple change that follows existing patterns. The main gap is the missing integration with the update_raw_file() interface function. Once that's addressed and tests are added, this will be ready to merge.

Would you like me to help prepare a follow-up commit with the update_raw_file() changes and tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants