Skip to content

Introduce Web-scraping inside JabRef #11093

Open
@koppor

Description

@koppor

Currently, our web search sends out search strings to API endpoints and then interprets the results. In other words: We have fetchers with API key and screen scraping. For the screen scapers, they mostly don't work. We should switch to a browser-based screen-scraping. Mostly because of CloudFlare.

JabRef should display the HTML page inside JabRef and offer scraping the citations directly from the page. Similar as BibDesk does.

316482562-b4a3d1e7-bd0a-4475-ae52-71120ae0d1fe 316482726-6a80130f-f920-44a4-8689-f420fa459226

Maybe the Java Chromium Embedded Framework (JCEF) helps. The test class https://github.com/chromiumembedded/java-cef/blob/master/java/tests/detailed/handler/RequestHandler.java seems to guide one to the usage.


The PR #7075 attempted to display the Google Scholar captchas in JabRef. The PR was not completed. -- This issue says: Rewrite the fetchers not to use URLDownload, but JCEF.

Note that this is different from #11093. There, a new UI is demanded.

Here, it should be allowed that the fetchers run stand-alone without user interaction.


Affected fetchers:

  • ACS: org.jabref.logic.importer.fetcher.ACS
  • Google Scholar: org.jabref.logic.importer.fetcher.GoogleScholar)
  • Icar: org.jabref.logic.importer.fetcher.IacrEprintFetcher
  • JStor: org.jabref.logic.importer.fetcher.JstorFetcher
  • ResearchGate: org.jabref.logic.importer.fetcher.ResearchGate
  • ScienceDirect: org.jabref.logic.importer.fetcher.ScienceDirect
  • SpringerLink: org.jabref.logic.importer.fetcher.SpringerLink

Sometimes, the API used. Then findFullText is the method handling HTML only.

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Free to take

    Status

    Normal priority

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions