-
Notifications
You must be signed in to change notification settings - Fork 9
Enable Seda 2.3 support with extensive refactoring and metadata handling #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- fix some tests failures on Windows - clean cross-platform end of line normalization using assertj standards (some expected values had so to be precised/corrected)
- add a toggle button to have labels in english
- add the capacity to extract or not messages, contacts and appointments - add the capacity to extract or not content of theese different elements and list of them
- use new mailextractlib parameters
- Problem: Shaded JAR construction caused malfunction of some Jakarta Mail classes. - Improvment: Replaced the old JavaMail library with Jakarta Mail (compatible with JDK 11+). - Fix: Updated the POM to include Jakarta Mail data handlers by preserving mailcap and related resources.
- Problem: JAR construction of Resip caused malfunction of some Jakarta Mail and Apache Tika classes. - Fix: Updated the POM to correctly include Jakarta Mail content handlers (via mailcap and mimetypes.default) in the shaded JAR. - Also: Forced Apache Commons IO to version ≥ 2.7 to avoid class conflicts (UnsynchronizedByteArrayOutputStream) that broke attachment extraction.
- Fix: long extracted texts are sometime badly cutted generating a java error, change cutting algorithm and improve speed - Fix: use of pattern in sanitizing with quote protection - Improve XMLString cleaning patterns and use only at metadata writing - Improve String generation efficiency with stringbuilders - Also: add unitary test for metadata generation
- Add mime datahandlers for multipart/signed treatment - Improve resistance to unknown mimetypes - Add tests for pgp and pkcs7 extraction
- Log level is linked with debug mode
- Fix: crash when TNEF attachement contains attachements (change iterator list) - Add winmail.dat as an explict attachment when problems during analysis
- Fix: avoid exception when Content-Transfer-Encoding is unknown (e.g., wrongly set to a charset name) - Behavior: fallback to treating the stream as '8bit' when the encoding is unrecognized
…harsets) - Add mechanism to declare new alias names for existing charsets and to add new charset implementations. - Introduce a UTF-7 charset collection to enable proper encoding and decoding operations. - Declare MACINSTOSH equivalent to x-MacRoman, and UNKNOWN to ISO-8859-1
- For most text headers, check for the presence of left encoded blocks after decoding and apply a lenient RFC2047 decoder when detected.
- Expect a single address: - If multiple identical addresses are encountered, retain one without warning. - If multiple distinct addresses are encountered, retain all (comma-separated) and issue a warning.
- Enable lenient Base64 decoding of MIME parts.
…rallel extraction Objective: Elements (Message, Contacts, Appointments) can be processed in parallel, Folders being sequentially treated. - Enhanced data consistency and parallelism by introducing synchronized methods, thread-safe classes (e.g., `AtomicInteger` and `ConcurrentHashMap`), and final fields. - Updated logging mechanisms for safe multi-threaded operation.
…il stores (MIME). - Added parallel processing for folder message extraction and listing using a thread pool - Adjusted test expectations to align with updated folder message processing.
…Path) - Fix: More robust decoding of encoded email addresses. - Improvement: Filter out invalid email addresses in addition to removing duplicates.
- Extracted the logic for formatting email addresses into a new `getFormattedAddress` - Tolerate "fake" address with only a name and no smtp address (quite common in mails with sender and recipient on the same exchange server)
- Simplified parameter declarations, replacing redundant fields with more concise names. - Enhanced command-line option parsing by adding validation, defaults, and new options (e.g., outputname, extractchoices). - Improved compatibility between CLI and GUI by unifying parameter handling and refactoring logic for readability and maintainability. - Change the global READMEs with new options and right GUI image - Introduce MailExtractCmd for console-based execution while keeping MailExtract as the GUI version (options are copied to GUI, but no action -x,l or z- allowed) This has been done in this commit to simplify debug launch with all options enabled
- Simplified parameter declarations, replacing redundant fields with more concise names. - Enhanced command-line option parsing by adding validation, defaults, and new options (e.g., outputname, extractchoices). - Improved compatibility between CLI and GUI by unifying parameter handling and refactoring logic for readability and maintainability. - Change the global README with new options and right GUI image This has been done in this commit to simplify debug launch with all options enabled
Using Microsoft document [MS-OXOCAL]: Appointment and Meeting Object Protocol https://docs.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxocal/09861fde-c8e4-4028-9346-e7c214cfdba1 - Complete Override flags and add override getColor - Create directly String version of Subject and Location - Normalize "datetime" (StartDateTime and EndDateTime) and Calendar to UTC Time Date - Simplify appointment date computing - Fix (operate for all timezones) and improve exceptions identification - Put appointment date computing functions in PSTAppointment.java - Clean functions - Fix test accordingly
…a default one (08/2019)
…non commited on Feb 9, 2021) - Add messageClass of AbchPerson which appears in outlook 2019. Add check to avoid Npe in a file. - Add getting smtp addresses from Messages - Add generic method to get other values in the future that are not present in the current message.
… calls case (proposed merge snumr on 03/2025)
… metadata names translations
…reuse with DataObjects - Introduced `ComplexListInterface` to encapsulate shared logic between `ComplexListType` and DataObjects (Binary and Physical). - Refactored `ComplexListType` to utilize common functionalities from `ComplexListInterface`, reducing redundancy and improving maintainability. - Merged redundant internal data structures: - Unified metadata list and ordered `HashMap` structures into a single consistent implementation. - Adjusted related logic in Resip editors to accommodate these structural changes.
…aObject consistency - TODO: save in FileObject all metadatas
…ata handling with ComplexListInterface - Reworked BinaryDataObject to use metadata lists and maps for better flexibility and reduced redundancy. - Fixed vales in tests and add one for Seda2.3
…adata handling with ComplexListInterface - Reworked PhysicalDataObject to use metadata lists and maps for better flexibility and reduced redundancy. - Fixed vales in tests and add one for Seda2.3
…actorize code with a parent class - Introduced `AbstractUnitaryDataObject` as a new parent class for Binary and Physical data objects, centralizing shared functionality. - Refactored `BinaryDataObject` and 'PhysicalDataObject' to extend `AbstractUnitaryDataObject`, reducing redundant code and improving maintainability. - General code cleaning
… it with `EnumType` - Deleted the `DescriptionLevel` class and its associated methods. - Updated references to `DescriptionLevel` in `Content` to use `EnumType` instead. - Added `DescriptionLevel` values to `EnumTypeConstants`. - Cleaned up unused imports and redundant code related to `DescriptionLevel`.
…enhance maintainability - Refactored `BinaryDataObjectEditor` and `PhysicalDataObjectEditor` to extend `AbstractUnitaryDataObjectEditor`, removing redundant methods. - Improved XML type naming for better clarity.
…n DateTimeType - Added checks for null or empty `dateString` input in constructors and setters to prevent errors, especially during serialization.
…ion behavior - Corrected condition to properly detect and handle empty DataObjectGroups. - Ensured the application focuses on the ArchiveUnit in the tree view after removal of an empty DataObjectGroup. - Minor formatting adjustments for consistency.
… evolution - Replaced direct metadata assignments with `addMetadata` and `addNewMetadata` methods for better consistency and clarity.
- Introduced `Relationship` metadata type in SEDA metadata structure, replacing `AnyXMLType` where applicable. - Updated `METADATA_MAP` in `BinaryDataObject` and `PhysicalDataObject` to reflect `Relationship` class usage. - Added translations for `Relationship` and related metadata elements in `resip`. - Enhanced tests to verify `Relationship` metadata functionality.
New Issues (13)Checkmarx found the following issues in this Pull Request
Fixed Issues (40)Great job! The following issues were fixed in this Pull Request
|
- Refactored `DeCompactor` and `FileObject` to standardize and simplify metadata handling using a new `setMetadataList` method. - Added support for SEDA v3 metadata structure, enabling new fields like `DataObjectUse`, `DataObjectNumber`, and `PersistentIdentifier`. - Enhanced `CompactorTest` with a new test for SEDA v3 compacting behavior using updated content verification. - Updated `.gitignore` with additional paths for excluding generated documentation files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Il pourrait être intéressant de reformater le code comme dans Vitam, avec le plugin spotless. Voir https://github.com/ProgrammeVitam/vitam-ui/blob/develop/pom.xml#L270
sedalib/src/main/java/fr/gouv/vitam/tools/sedalib/core/SEDA2Version.java
Show resolved
Hide resolved
sedalib/src/main/java/fr/gouv/vitam/tools/sedalib/metadata/ManagementMetadata.java
Show resolved
Hide resolved
sedalib/src/main/java/fr/gouv/vitam/tools/sedalib/metadata/ManagementMetadata.java
Show resolved
Hide resolved
sedalib/src/main/java/fr/gouv/vitam/tools/sedalib/core/BinaryDataObject.java
Outdated
Show resolved
Hide resolved
sedalib/src/main/java/fr/gouv/vitam/tools/sedalib/core/AbstractUnitaryDataObject.java
Show resolved
Hide resolved
resip/src/main/java/fr/gouv/vitam/tools/resip/sedaobjecteditor/SEDAObjectEditorConstants.java
Show resolved
Hide resolved
et corriger les tests |
- Fix timezone-dependent test inconsistencies - Implement ISO 8601 format (yyyy-MM-dd'T'HH:mm:ss'Z') for dates - Ensure consistent test results across all platforms - Set timezone to UTC for global standardization
…ages - Fix timezone-dependent test inconsistencies - Set timezone to UTC for global standardization (cherry picked from commit fbbea70)
- Ensures better test reliability by ignoring unique identifiers that may vary (cherry picked from commit 3dfade9)
Les corrections de tests étaient celle de l'extraction de messages déjà faites dans le premier merge. Je les ai remises en CherryPick pour l'assurer, j'espère que cela ne va pas générer de problèmes au merge... Pour ce qui est du formatage de code, j'ai besoin de plus d'explications. |
Bonjour @JSLair, nous envisageons de merger tes PRs dès que possible. Après analyse, je remarque que les commits de la PR #82 sont inclus dans cette PR. Comment préfères-tu que l'on procède ? Je merge la PR #82 puis tu rebases cette PR en cas de conflit. Ou je merge directement cette PR ? Vu que ce sont des Cherry-Pick d'une partie des éléments de l'autre PR, ça me semble plus cohérent de merger la PR précédente afin de mieux séparer les sujets mais il y a un risque de conflit sur cette PR par la suite. |
Summary
This pull request introduces support for Seda 2.3 across multiple components and tools, alongside significant refactoring aimed at improving maintainability, flexibility, and consistency. The changes span several areas:
BinaryDataObject
andPhysicalDataObject
to utilize the new unifiedAbstractUnitaryDataObject
, centralizing shared functionalities and reducing redundancy.ComplexListInterface
for better alignment with Seda 2.3 requirements.BinaryDataObjectEditor
,PhysicalDataObjectEditor
) to simplify and support new metadata structures.Key Changes
AbstractUnitaryDataObject
andComplexListInterface
.DescriptionLevel
implementation in favor of a unifiedEnumType
.Impact
Additional Notes
To be done