Skip to content

Conversation

@rlskoeser
Copy link
Contributor

@rlskoeser rlskoeser commented Oct 22, 2025

Associated Issue(s): ~

Changes in this PR

  • Try a dict lookup for pages on TEI - was hoping this would improve performance (only marginal gain)
  • Use configure_logging in the sentence corpus creation script, and add verbose mode (debug logging)

Notes

  • Haven't added / adjusted tests for the changes. Let me know what you think about the changes and whether you think they're worth testing.

Reviewer Checklist @tanhaow

  • Review code changes
  • Test create script locally and try the verbose mode
  • Recommend how to proceed (add unit tests for chnages? drop page n dictionary lookup?)

@rlskoeser rlskoeser requested a review from tanhaow October 22, 2025 13:42
@rlskoeser rlskoeser added the 👇this sprint Work scheduled for the current sprint label Oct 22, 2025
@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.72%. Comparing base (7fb4da2) to head (0f03180).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #266   +/-   ##
========================================
  Coverage    99.71%   99.72%           
========================================
  Files           26       26           
  Lines         1402     1429   +27     
  Branches        51       51           
========================================
+ Hits          1398     1425   +27     
  Misses           1        1           
  Partials         3        3           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@tanhaow tanhaow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change, running in Marimo took 223.2s:

Processed Das_Kapital_MEGA_A2_B005-00_ETX.xml with 655 pages in 223.2 seconds

Before the change, running with command line took 220s:

Job started at 2025-10-22 13:37:52
finished at 2025-10-22 13:41:32

After adding dictionary lookup, running with command line took 207.4s:

Processed Das_Kapital_MEGA_A2_B005-00_ETX.xml with 655 pages in 207.4 seconds

So, there's indeed some marginal gain here (speed improved 5.7%). I suggest we keep the dictionary lookup and add unit test for it.
The new logging for corpus creation with command line is very useful. Thank you for working on it! @rlskoeser

@rlskoeser
Copy link
Contributor Author

@tanhaow thank you for reviewing and testing! I've updated the tests and change log. Can you merge into develop and pull into the release branch?

@tanhaow tanhaow merged commit 8d51b3d into develop Oct 22, 2025
8 checks passed
@tanhaow tanhaow deleted the feature/improve-input-logging branch October 22, 2025 18:40
@mnaydan mnaydan removed the 👇this sprint Work scheduled for the current sprint label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants