Skip to content

Conversation

@jsochava
Copy link

@jsochava jsochava commented Oct 27, 2025

Closes #14085

This PR introduces a complete, deterministic pipeline for harvesting contextual summaries from a citing paper’s Related Work section, matching them to BibEntries, and optionally summarizing them using the new LangChain-based AI interface.

What and why

  1. RelatedWorkAnnotator.java
  • Appends contextual summaries from a citing paper’s “Related Work” section into a target BibEntry.
  • Uses JabRef’s comment- convention (resolved as UserSpecificCommentField).
  1. HeuristicRelatedWorkExtractor.java
  • Deterministic parser that locates author–year citations within Related Work text.
  • Extracts descriptive snippets surrounding each citation.
  • Matches each citation to an existing BibEntry by first author surname + year.
  • Implemented without AI dependencies; designed for reliability and transparent logic.
  1. RelatedWorkHarvester.java
  • High-level orchestrator that connects the extractor and annotator:
  • Accepts PDF-extracted or plain text input.
  • Calls the extractor to identify citation–context pairs.
  • Invokes RelatedWorkAnnotator.appendSummaryToEntry(...) for each match.
  1. RelatedWorkSectionLocator.java
  • Deterministically isolates the “Related Work” / “Literature Review” / “Prior Work” / “Background and Related Work” section from a paper’s full plain text.
  • Recognizes numeric and textual headers.
  • Captures content until the next top-level header, ignoring figure/table captions and unrelated content.
  1. RelatedWorkPipeline.java
  • Introduces a convenience façade that chains the full extraction process:
    - SectionLocator → HeuristicRelatedWorkExtractor → RelatedWorkAnnotator
  • Enables single-call usage for clients that have the full plain text and a list of candidate BibEntry objects.
  1. PdfTextProvider.java
    -A tiny SPI interface: Path -> Optional plain-text extraction.
    -Keeps all PDF specifics behind a seam; facilitates unit testing with fakes/mocks and avoids hard deps in core logic.

  2. PdfRelatedWorkTextExtractor.java

  • Adapter: PDF -> plain text (via PdfTextProvider) → “Related Work” block (via RelatedWorkSectionLocator).
  • Returns Optional with the body of the section (header stripped), or empty if not found/blank.
  • Validates inputs and surfaces IO errors; does not depend on any PDF library directly.
  1. RelatedWorkPdfPipeline.java
  • End-to-end façade for callers that have a PDF and candidate entries:
  • PdfRelatedWorkTextExtractor → HeuristicRelatedWorkExtractor → RelatedWorkAnnotator.
  • Returns the number of annotations appended across matched entries.
  1. RelatedWorkEvaluationRunner.java
  • Deterministic evaluator comparing extracted (citation → snippet) pairs against gold fixtures.
  • Computes precision, recall, F1, and coverage statistics.
  • Supports in-memory or JSON-based fixture definitions.
  1. RelatedWorkMetrics.java
  • Immutable results object summarizing global and per-entry metrics.
  • Includes pretty-printed summaries and detailed statistics for debugging.
  1. RelatedWorkFixture.java
  • Simple model for “gold” fixture data: includes related-work text and expected (author, year, snippet) expectations.
  • Supports direct in-memory creation or loading from a JSON file.
  1. HeuristicExtractorAdapter.java
  • Bridge layer converting the HeuristicRelatedWorkExtractor output (Map<String, String>) into the Map<BibEntry, List> format expected by the evaluation runner.
  • Keeps the original extractor untouched.
  1. RelatedWorkSummarizer.java / NoOpRelatedWorkSummarizer.java
  • SPI interface for optional snippet summarization.
  • Default no-op implementation returns empty → the harvester keeps individual snippets unchanged.
  • Future AI integrations (e.g., LangChain4j or OpenAI) can implement this interface.
  1. CitationResolver.java / NoOpCitationResolver.java
  • SPI interface for resolving missing citations by key or author–year, optionally creating new entries.
  • Default implementation performs a simple local lookup and never creates entries.
  1. RelatedWorkPluginConfig.java + RelatedWorkPluginsFactory.java
  • Lightweight configuration object with feature flags:
    - enableSummarization
    - enableResolution
  • Provides builder methods to safely compose plugin pipelines.
  • Used by RelatedWorkHarvester to inject summarizer and resolver instances.
  • Default build → no-op config (preserves old behavior).
  1. LangChainRelatedWorkSummarizer.java
  • Merge multiple snippets for a citation
  • Produce a concise, readable contextual summary
  • Control temperature, max tokens, and model provider
  1. RelatedWorkAiModule.java
  • Provides construction & wiring for:
    • LangChainRelatedWorkSummarizer
    • No-op fallback when AI is disabled
    • Integration with JabRef’s existing AI preferences & providers
  • Also updates plugin registry logic so the harvester can seamlessly receive AI components.

Next steps

  1. Production PDF Integration
  • Connect this pipeline to JabRef’s existing PDF full-text extraction tools.
  1. Improved citation resolution
  • Handle edge cases like ambiguous years or multiple first authors.
  1. Expanded AI summarization capabilities
  • Provide multiple style options (concise vs detailed)
  • Use model-selection from JabRef preferences

Steps to test

  1. Run the unit tests for the new features only:
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.RelatedWorkAnnotatorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.HeuristicRelatedWorkExtractorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.RelatedWorkHarvesterTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.RelatedWorkSectionLocatorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.PdfRelatedWorkTextExtractorTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.RelatedWorkMetricsTest"
    ./gradlew :jablib:test --tests "org.jabref.logic.importer.relatedwork.LangChainRelatedWorkSummarizerTest"

Mandatory checks

  • I own the copyright of the code submitted and I license it under the MIT license
  • I manually tested my changes in running JabRef (always required)
  • I added JUnit tests for changes (if applicable)
  • [/] I added screenshots in the PR description (if change is visible to the user)
  • [/] I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
  • [/] I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

@github-actions
Copy link
Contributor

Hey @jsochava!

Thank you for contributing to JabRef! Your help is truly appreciated ❤️.

We have automatic checks in place, based on which you will soon get automated feedback if any of them are failing. We also use TragBot with custom rules that scans your changes and provides some preliminary comments, before a maintainer takes a look. TragBot is still learning, and may not always be accurate. In the "Files changed" tab, you can go through its comments and just click on "Resolve conversation" if you are sure that it is incorrect, or comment on the conversation if you are doubtful.

Please re-check our contribution guide in case of any other doubts related to our contribution workflow.

@jsochava jsochava force-pushed the feature/related-work-annotator branch from 21d4bac to 711c3a9 Compare October 30, 2025 00:47
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check your IDE config. It seems you refomatted too much with the wrong style. Hard to give content feedback.

@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Nov 15, 2025
@jsochava
Copy link
Author

@koppor when I used the IDE config described in the project's setup instructions, I consistently failed the automatic format checks. Is that expected?

@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Nov 16, 2025
Comment on lines 39 to 40
<option name="ENABLE_JAVADOC_FORMATTING" value="false" />
<option name="JD_ALIGN_PARAM_COMMENTS" value="false" />
<option name="JD_ALIGN_EXCEPTION_COMMENTS" value="false" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you changed this by accident... Please revert.

@koppor
Copy link
Member

koppor commented Nov 17, 2025

@koppor when I used the IDE config described in the project's setup instructions, I consistently failed the automatic format checks. Is that expected?

The configs changed. I don't know why. - See https://github.com/JabRef/jabref/pull/14187/files#r2535333190

@jsochava jsochava requested a review from koppor November 19, 2025 04:02
@jsochava jsochava marked this pull request as ready for review November 19, 2025 04:04
@koppor
Copy link
Member

koppor commented Nov 19, 2025

@jsochava Please either install the browser plugin "refined github" to click on "Discard changes" on the 250+ files - or use some git magic to revert the change - force push is ok...

  1. git fetch upstream
  2. git merge upstream/main
  3. git reset upstream/main
  4. now use a git tool (such as git gui) to commit your changes
  5. git commit
  6. git push -f

Don't forget to note down your commit id before the rewrite - in case something goes wrong, you can reset...


Reason: We cannot let in 300+ files wrongly formatted. Reason: We are still humans changing the code and not fully AI.

Example of bad reformats:

Uploading grafik.png…

@koppor koppor added the status: changes-required Pull requests that are not yet complete label Nov 19, 2025
Removing all accidentally changed files and only commiting related work files
@jsochava jsochava force-pushed the feature/related-work-annotator branch from f2fc2d1 to d333a38 Compare November 22, 2025 00:45
@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Nov 22, 2025
@koppor koppor added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first contrib status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract text about papers from "related work" sections

2 participants