Skip to content

Conversation

@tanhaow
Copy link
Contributor

@tanhaow tanhaow commented Oct 17, 2025

Associated Issue(s): #250 #253 #256

Changes in this PR

  • Added a preliminary FileInput class for ALTO XML that takes in a zipfile and iterates over its members
  • Added a function to validate ALTO zipfiles and cache the discovered member filenames (to avoid rescanning)
  • Added unit tests and a sample ALTO zipfile fixture

Notes

  • Used ALTO v4 namespace (aligned with what was used in geniza)
  • Checked the Marimo Corpus Builder notebook’s upload control: it draws allowed extensions straight from the backend input classes (filetypes=FileInput.supported_types()), and because our ALTOInput subclass now declares the .zip file type, the widget already accepts zip uploads. Also tested uploading a zip in the UI, which worked. So Make Marimo Corpus Builder notebook accept zipfile input #256 can be closed.

@codecov
Copy link

codecov bot commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 98.41270% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.29%. Comparing base (0871d56) to head (cbb1ad1).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #258      +/-   ##
===========================================
- Coverage    99.40%   99.29%   -0.12%     
===========================================
  Files           24       26       +2     
  Lines         1012     1135     +123     
  Branches        28       37       +9     
===========================================
+ Hits          1006     1127     +121     
- Misses           2        4       +2     
  Partials         4        4              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tanhaow tanhaow marked this pull request as ready for review October 17, 2025 12:28
@tanhaow tanhaow requested a review from rlskoeser October 17, 2025 13:21
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect some of this logic to change a fair bit when we integrate the xml parsing, but I think it would be helpful to merge this in as an initial implementation to build on.

I'd like to keep the validation issue open because that's one of the pieces I expect to change most. We didn't discuss specifics, but I think we can be a bit more forgiving. I would propose the following:

  • non-xml files are ignored (but logged - maybe info)
  • non-alto xml files are ignored (but log a warning)
  • error if no alto xml files are found in the zipfile

I also think we won't need to iterate over the files twice, the parsing should also handle the validation piece you're doing now - i.e., if we attempt to open a non-alto file as our alto xml object, then the object should raise an exception that we can catch and report here in the input class.

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>
@tanhaow
Copy link
Contributor Author

tanhaow commented Oct 20, 2025

I expect some of this logic to change a fair bit when we integrate the xml parsing, but I think it would be helpful to merge this in as an initial implementation to build on.

I'd like to keep the validation issue open because that's one of the pieces I expect to change most. We didn't discuss specifics, but I think we can be a bit more forgiving. I would propose the following:

  • non-xml files are ignored (but logged - maybe info)
  • non-alto xml files are ignored (but log a warning)
  • error if no alto xml files are found in the zipfile

I also think we won't need to iterate over the files twice, the parsing should also handle the validation piece you're doing now - i.e., if we attempt to open a non-alto file as our alto xml object, then the object should raise an exception that we can catch and report here in the input class.

Thank you for suggesting this. I’ve added them as a note in the validation issue and will address them.

@tanhaow tanhaow merged commit 3d2e79e into develop Oct 20, 2025
8 checks passed
@tanhaow tanhaow deleted the feature/fileinput-class-for-alto branch October 20, 2025 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants