Skip to content

Enable Seda 2.3 support with extensive refactoring and metadata handling #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 89 commits into
base: master
Choose a base branch
from

Conversation

JSLair
Copy link
Contributor

@JSLair JSLair commented May 5, 2025

Summary

This pull request introduces support for Seda 2.3 across multiple components and tools, alongside significant refactoring aimed at improving maintainability, flexibility, and consistency. The changes span several areas:

  • Refactored BinaryDataObject and PhysicalDataObject to utilize the new unified AbstractUnitaryDataObject, centralizing shared functionalities and reducing redundancy.
  • Enhanced metadata handling via ComplexListInterface for better alignment with Seda 2.3 requirements.
  • Updated editors (BinaryDataObjectEditor, PhysicalDataObjectEditor) to simplify and support new metadata structures.
  • Introduced support for new Seda 2.3-specific metadata types and validation.

Key Changes

  • Added Seda 2.3 compatibility to high-level objects and complex list types.
  • Refactored existing types to handle dynamic metadata, consolidating lists and maps for optimized implementation.
  • Improved modularity by introducing AbstractUnitaryDataObject and ComplexListInterface.
  • Enhanced GUI components and editors for Seda 2.3 preferences and metadata translations.
  • Enabled Seda 2.3 validation across tools.
  • Removed outdated DescriptionLevel implementation in favor of a unified EnumType.

Impact

  • Improves compatibility with Seda 2.3 standard.
  • Reduces code duplication and enhances maintainability by centralizing shared logic.
  • Streamlines user experience with enhanced GUI editor functionality and Seda 2.3 metadata support.

Additional Notes

  • Tests updated to verify Seda 2.3 implementation.
  • Minor formatting and code-cleaning adjustments applied for consistency.

To be done

  • Improve Compact/Decompact experimental functions to handle all SEDA versions
  • Use either ObjectDataVersion or ObjectDataUse/ObjectDataNumber for disk export and explicit filename generation (mailextractlib, sedalib)

- fix some tests failures on Windows
- clean cross-platform end of line normalization using assertj standards (some expected values had so to be precised/corrected)
- add a toggle button to have labels in english
- add the capacity to extract or not messages, contacts and appointments
- add the capacity to extract or not content of theese different elements and list of them
- use new mailextractlib parameters
- Problem: Shaded JAR construction caused malfunction of some Jakarta Mail classes.

- Improvment: Replaced the old JavaMail library with Jakarta Mail (compatible with JDK 11+).

- Fix: Updated the POM to include Jakarta Mail data handlers by preserving mailcap and related resources.
- Problem: JAR construction of Resip caused malfunction of some Jakarta Mail and Apache Tika classes.

- Fix: Updated the POM to correctly include Jakarta Mail content handlers (via mailcap and mimetypes.default) in the shaded JAR.

- Also: Forced Apache Commons IO to version ≥ 2.7 to avoid class conflicts (UnsynchronizedByteArrayOutputStream) that broke attachment extraction.
- Fix: long extracted texts are sometime badly cutted generating a java error, change cutting algorithm and improve speed
- Fix: use of pattern in sanitizing with quote protection
- Improve XMLString cleaning patterns and use only at metadata writing
- Improve String generation efficiency with stringbuilders

- Also: add unitary test for metadata generation
- Add mime datahandlers for multipart/signed treatment
- Improve resistance to unknown mimetypes
- Add tests for pgp and pkcs7 extraction
- Log level is linked with debug mode
- Fix: crash when TNEF attachement contains attachements (change iterator list)
- Add winmail.dat as an explict attachment when problems during analysis
- Fix: avoid exception when Content-Transfer-Encoding is unknown (e.g., wrongly set to a charset name)
- Behavior: fallback to treating the stream as '8bit' when the encoding is unrecognized
…harsets)

- Add mechanism to declare new alias names for existing charsets and to add new charset implementations.
- Introduce a UTF-7 charset collection to enable proper encoding and decoding operations.
- Declare MACINSTOSH equivalent to x-MacRoman, and UNKNOWN to ISO-8859-1
- For most text headers, check for the presence of left encoded blocks after decoding and apply a lenient RFC2047 decoder when detected.
- Expect a single address:
   - If multiple identical addresses are encountered, retain one without warning.
   - If multiple distinct addresses are encountered, retain all (comma-separated) and issue a warning.
- Enable lenient Base64 decoding of MIME parts.
…rallel extraction

Objective: Elements (Message, Contacts, Appointments) can be processed in parallel, Folders being sequentially treated.
- Enhanced data consistency and parallelism by introducing synchronized methods, thread-safe classes (e.g., `AtomicInteger` and `ConcurrentHashMap`), and final fields.
- Updated logging mechanisms for safe multi-threaded operation.
…il stores (MIME).

- Added parallel processing for folder message extraction and listing using a thread pool
- Adjusted test expectations to align with updated folder message processing.
…Path)

- Fix: More robust decoding of encoded email addresses.
- Improvement: Filter out invalid email addresses in addition to removing duplicates.
- Extracted the logic for formatting email addresses into a new `getFormattedAddress`
- Tolerate "fake" address with only a name and no smtp address (quite common in mails with sender and recipient on the same exchange server)
- Simplified parameter declarations, replacing redundant fields with more concise names.
- Enhanced command-line option parsing by adding validation, defaults, and new options (e.g., outputname, extractchoices).
- Improved compatibility between CLI and GUI by unifying parameter handling and refactoring logic for readability and maintainability.
- Change the global READMEs with new options and right GUI image
- Introduce MailExtractCmd for console-based execution while keeping MailExtract as the GUI version (options are copied to GUI, but no action -x,l or z- allowed)

This has been done in this commit to simplify debug launch with all options enabled
- Simplified parameter declarations, replacing redundant fields with more concise names.
- Enhanced command-line option parsing by adding validation, defaults, and new options (e.g., outputname, extractchoices).
- Improved compatibility between CLI and GUI by unifying parameter handling and refactoring logic for readability and maintainability.
- Change the global README with new options and right GUI image
This has been done in this commit to simplify debug launch with all options enabled
Using Microsoft document [MS-OXOCAL]: Appointment and Meeting Object Protocol
https://docs.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxocal/09861fde-c8e4-4028-9346-e7c214cfdba1
- Complete Override flags and add override getColor
- Create directly String version of Subject and Location
- Normalize "datetime" (StartDateTime and EndDateTime) and Calendar to UTC Time Date
- Simplify appointment date computing
- Fix (operate for all timezones) and improve exceptions identification
- Put appointment date computing functions in PSTAppointment.java
- Clean functions
- Fix test accordingly
…non commited on Feb 9, 2021)

- Add messageClass of AbchPerson which appears in outlook 2019.
Add check to avoid Npe in a file.
- Add getting smtp addresses from Messages
- Add generic method to get other values in the future that are not present in the current message.
… calls case (proposed merge snumr on 03/2025)
JSLair added 17 commits May 1, 2025 23:46
…reuse with DataObjects

- Introduced `ComplexListInterface` to encapsulate shared logic between `ComplexListType` and DataObjects (Binary and Physical).
- Refactored `ComplexListType` to utilize common functionalities from `ComplexListInterface`, reducing redundancy and improving maintainability.
- Merged redundant internal data structures:
  - Unified metadata list and ordered `HashMap` structures into a single consistent implementation.
  - Adjusted related logic in Resip editors to accommodate these structural changes.
…aObject consistency

- TODO: save in FileObject all metadatas
…ata handling with ComplexListInterface

- Reworked BinaryDataObject to use metadata lists and maps for better flexibility and reduced redundancy.

- Fixed vales in tests and add one for Seda2.3
…adata handling with ComplexListInterface

- Reworked PhysicalDataObject to use metadata lists and maps for better flexibility and reduced redundancy.

- Fixed vales in tests and add one for Seda2.3
…actorize code with a parent class

- Introduced `AbstractUnitaryDataObject` as a new parent class for Binary and Physical data objects, centralizing shared functionality.
- Refactored `BinaryDataObject` and 'PhysicalDataObject' to extend `AbstractUnitaryDataObject`, reducing redundant code and improving maintainability.
- General code cleaning
… it with `EnumType`

- Deleted the `DescriptionLevel` class and its associated methods.
- Updated references to `DescriptionLevel` in `Content` to use `EnumType` instead.
- Added `DescriptionLevel` values to `EnumTypeConstants`.
- Cleaned up unused imports and redundant code related to `DescriptionLevel`.
…enhance maintainability

- Refactored `BinaryDataObjectEditor` and `PhysicalDataObjectEditor` to extend `AbstractUnitaryDataObjectEditor`, removing redundant methods.
- Improved XML type naming for better clarity.
…n DateTimeType

- Added checks for null or empty `dateString` input in constructors and setters to prevent errors, especially during serialization.
…ion behavior

- Corrected condition to properly detect and handle empty DataObjectGroups.
- Ensured the application focuses on the ArchiveUnit in the tree view after removal of an empty DataObjectGroup.
- Minor formatting adjustments for consistency.
… evolution

- Replaced direct metadata assignments with `addMetadata` and `addNewMetadata` methods for better consistency and clarity.
- Introduced `Relationship` metadata type in SEDA metadata structure, replacing `AnyXMLType` where applicable.
- Updated `METADATA_MAP` in `BinaryDataObject` and `PhysicalDataObject` to reflect `Relationship` class usage.
- Added translations for `Relationship` and related metadata elements in `resip`.
- Enhanced tests to verify `Relationship` metadata functionality.
@vitam-prg
Copy link

vitam-prg commented May 5, 2025

Logo
Checkmarx One – Scan Summary & Detailsfd51b932-8688-4288-8de8-fb0da54874af

New Issues (13)

Checkmarx found the following issues in this Pull Request

Severity Issue Source File / Package Checkmarx Insight
MEDIUM CVE-2025-46392 Maven-commons-configuration:commons-configuration-1.10
detailsDescription: Uncontrolled Resource Consumption vulnerability in Apache Commons Configuration versions 1.x. There are a number of issues in Apache Commons Confi...
Attack Vector: NETWORK
Attack Complexity: LOW

ID: 8vzMJHrBPbkMDvK2ygtuBFJzNOSg1wPP3xqx%2Fxbc5YQ%3D
Vulnerable Package
MEDIUM Privacy_Violation /mailextractlib/src/main/java/fr/gouv/vitam/tools/mailextractlib/store/microsoft/pst/PstStoreContact.java: 188
detailsMethod getHomeAddress at line 188 of /mailextractlib/src/main/java/fr/gouv/vitam/tools/mailextractlib/store/microsoft/pst/PstStoreContact.java sen...
ID: 3DwZM47li3hKDPUgt6UEW01AAeE%3D
Attack Vector
MEDIUM Privacy_Violation /mailextractlib/src/main/java/fr/gouv/vitam/tools/mailextractlib/store/microsoft/pst/PstStoreContact.java: 231
detailsMethod analyzeAllContactInformations at line 231 of /mailextractlib/src/main/java/fr/gouv/vitam/tools/mailextractlib/store/microsoft/pst/PstStoreCo...
ID: c2NrB3ausnyy1g%2FfO5fi%2F6qWhBI%3D
Attack Vector
MEDIUM Privacy_Violation /javalibpst/src/main/java/fr/gouv/vitam/tools/javalibpst/PSTContact.java: 1145
detailsMethod toString at line 1145 of /javalibpst/src/main/java/fr/gouv/vitam/tools/javalibpst/PSTContact.java sends user information outside the applic...
ID: DafK5S%2B2odnbk9LYGcQbsaxkGzI%3D
Attack Vector
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 458
detailsMethod main at line 458 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java sends user information outside the appli...
ID: M44hwet%2BNdx%2B3xpbOldMpfNmhKQ%3D
Attack Vector
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 266
detailsMethod main at line 266 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java sends user information outside the appli...
ID: vfdK0f27YvNS2sG16v60etgTTog%3D
Attack Vector
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 266
detailsMethod main at line 266 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java sends user information outside the appli...
ID: HimYNvLVdsQ%2B6n6E3w9kPVrBONg%3D
Attack Vector
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 266
detailsMethod main at line 266 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java sends user information outside the appli...
ID: MTnWziMenL3n0p1EAuAsAUFgBsY%3D
Attack Vector
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 266
detailsMethod main at line 266 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java sends user information outside the appli...
ID: rFTg4obv2pBTkQX3XBsr4YhfbqM%3D
Attack Vector
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 266
detailsMethod main at line 266 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java sends user information outside the appli...
ID: eS3yEi4cEcCvnSmbnBrxlPPd0sc%3D
Attack Vector
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java: 343
detailsMethod parseParams at line 343 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java sends user information out...
ID: XQITt0QYJsd6EtF5MJridgJ4zk4%3D
Attack Vector
LOW Log_Forging /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java: 347
detailsMethod parseParams at line 347 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java gets user input from eleme...
ID: gjO47bl%2BxYPZE4I9cnh0mOUTnKk%3D
Attack Vector
LOW Log_Forging /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java: 346
detailsMethod parseParams at line 346 of /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java gets user input from eleme...
ID: v3tecuUivn%2BlPU5%2B1SmkgH2HLcI%3D
Attack Vector
Fixed Issues (40)

Great job! The following issues were fixed in this Pull Request

Severity Issue Source File / Package
CRITICAL CVE-2022-46364 Maven-org.apache.cxf:cxf-core-3.5.3
HIGH CVE-2021-35515 Maven-org.apache.commons:commons-compress-1.19
HIGH CVE-2021-35516 Maven-org.apache.commons:commons-compress-1.19
HIGH CVE-2021-35517 Maven-org.apache.commons:commons-compress-1.19
HIGH CVE-2021-36090 Maven-org.apache.commons:commons-compress-1.19
HIGH CVE-2022-23596 Maven-com.github.junrar:junrar-4.0.0
HIGH CVE-2022-3171 Maven-com.google.protobuf:protobuf-java-3.21.5
HIGH CVE-2022-3509 Maven-com.google.protobuf:protobuf-java-3.21.5
HIGH CVE-2022-3510 Maven-com.google.protobuf:protobuf-java-3.21.5
HIGH CVE-2022-40152 Maven-com.fasterxml.woodstox:woodstox-core-6.3.1
HIGH CVE-2022-42003 Maven-com.fasterxml.jackson.core:jackson-databind-2.13.4
HIGH CVE-2022-46363 Maven-org.apache.cxf:cxf-rt-transports-http-3.5.3
HIGH CVE-2023-2976 Maven-com.google.guava:guava-31.1-jre
HIGH CVE-2024-28752 Maven-org.apache.cxf:cxf-core-3.5.3
HIGH CVE-2024-47554 Maven-commons-io:commons-io-2.11.0
HIGH CVE-2024-47554 Maven-commons-io:commons-io-2.6
HIGH CVE-2024-7254 Maven-com.google.protobuf:protobuf-java-3.21.5
MEDIUM CVE-2012-5783 Maven-commons-httpclient:commons-httpclient-3.1
MEDIUM CVE-2012-6153 Maven-commons-httpclient:commons-httpclient-3.1
MEDIUM CVE-2020-14338 Maven-xerces:xercesImpl-2.12.0
MEDIUM CVE-2021-29425 Maven-commons-io:commons-io-2.6
MEDIUM CVE-2022-23437 Maven-xerces:xercesImpl-2.12.0
MEDIUM CVE-2024-21742 Maven-org.apache.james:apache-mime4j-core-0.8.4
MEDIUM CVE-2024-25710 Maven-org.apache.commons:commons-compress-1.21
MEDIUM CVE-2024-25710 Maven-org.apache.commons:commons-compress-1.19
MEDIUM CVE-2024-26308 Maven-org.apache.commons:commons-compress-1.21
MEDIUM Improper_Restriction_of_Stored_XXE_Ref /sedalib/src/test/java/fr/gouv/vitam/tools/sedalib/metadata/DataTest.java: 23
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 487
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java: 343
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 433
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 433
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 433
MEDIUM Privacy_Violation /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 487
LOW CVE-2020-8908 Maven-com.google.guava:guava-31.1-jre
LOW Heap_Inspection /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractMainWindow.java: 151
LOW Heap_Inspection /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java: 56
LOW Heap_Inspection /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractApp.java: 310
LOW Heap_Inspection /mailextractlib/src/main/java/fr/gouv/vitam/tools/mailextractlib/core/StoreExtractor.java: 242
LOW Log_Forging /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java: 346
LOW Log_Forging /mailextract/src/main/java/fr/gouv/vitam/tools/mailextract/MailExtractGraphicApp.java: 347

JSLair added 3 commits May 12, 2025 21:52
- Refactored `DeCompactor` and `FileObject` to standardize and simplify metadata handling using a new `setMetadataList` method.
- Added support for SEDA v3 metadata structure, enabling new fields like `DataObjectUse`, `DataObjectNumber`, and `PersistentIdentifier`.
- Enhanced `CompactorTest` with a new test for SEDA v3 compacting behavior using updated content verification.
- Updated `.gitignore` with additional paths for excluding generated documentation files.
@ebernard ebernard self-requested a review May 15, 2025 14:37
Copy link

@ebernard ebernard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Il pourrait être intéressant de reformater le code comme dans Vitam, avec le plugin spotless. Voir https://github.com/ProgrammeVitam/vitam-ui/blob/develop/pom.xml#L270

@ebernard
Copy link

ebernard commented May 19, 2025

et corriger les tests

JSLair added 3 commits June 3, 2025 21:09
- Fix timezone-dependent test inconsistencies
- Implement ISO 8601 format (yyyy-MM-dd'T'HH:mm:ss'Z') for dates
- Ensure consistent test results across all platforms
- Set timezone to UTC for global standardization
…ages

- Fix timezone-dependent test inconsistencies
- Set timezone to UTC for global standardization

(cherry picked from commit fbbea70)
- Ensures better test reliability by ignoring unique identifiers that may vary

(cherry picked from commit 3dfade9)
@JSLair
Copy link
Contributor Author

JSLair commented Jun 3, 2025

Les corrections de tests étaient celle de l'extraction de messages déjà faites dans le premier merge. Je les ai remises en CherryPick pour l'assurer, j'espère que cela ne va pas générer de problèmes au merge...

Pour ce qui est du formatage de code, j'ai besoin de plus d'explications.

@GiooDev
Copy link
Contributor

GiooDev commented Jun 12, 2025

Bonjour @JSLair, nous envisageons de merger tes PRs dès que possible.

Après analyse, je remarque que les commits de la PR #82 sont inclus dans cette PR.

Comment préfères-tu que l'on procède ? Je merge la PR #82 puis tu rebases cette PR en cas de conflit. Ou je merge directement cette PR ?

Vu que ce sont des Cherry-Pick d'une partie des éléments de l'autre PR, ça me semble plus cohérent de merger la PR précédente afin de mieux séparer les sujets mais il y a un risque de conflit sur cette PR par la suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants