Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Scholia queries on other SPARQL endpoints #2063

Open
Daniel-Mietchen opened this issue Jul 21, 2022 · 6 comments
Open

Test Scholia queries on other SPARQL endpoints #2063

Daniel-Mietchen opened this issue Jul 21, 2022 · 6 comments
Labels
enhancement some suggestions to improve Scholia

Comments

@Daniel-Mietchen
Copy link
Member

Daniel-Mietchen commented Jul 21, 2022

Is your feature request related to a problem? Please describe.

  • Scholia uses the Wikidata Query Service to run SPARQL queries over the Wikidata corpus.
  • The Wikidata Query Service uses Blazegraph as the backend for providing the SPARQL endpoint.
  • Blazegraph is not designed for graphs much larger than about 100 million items, which is about the size of the current Wikidata
  • An evaluation of Blazegraph alternatives for Wikidata is ongoing, with no clear timeline towards a solution.

Describe the solution you'd like

I'd like us to explore running Scholia on other SPARQL endpoints, Blazegraph or otherwise. We have done some of this in a past, but not in a way that would be scalable across all Scholia queries.

Describe alternatives you've considered

A relatively straightforward approach might be to build a workflow based on running Scholia via the SPARQL endpoint (default: Blazegraph again) of a dedicated Wikibase instance that holds a copy of a recent Wikidata dump. There could even be several such Wikibases, each serving a specific subset (e.g. per Scholia aspect).

Additional context

Other options would be to start exploring non-Blazegraph endpoints, e.g. https://wikidata.demo.openlinksw.com/sparql (running on Virtuoso) or https://qlever.cs.uni-freiburg.de/wikidata/ (running on QLever)

@Daniel-Mietchen Daniel-Mietchen added the enhancement some suggestions to improve Scholia label Jul 21, 2022
@Daniel-Mietchen
Copy link
Member Author

I just created a simplified version of one of our queries - country_authors.sparql

SELECT
?author 
(COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
(SAMPLE(?organization_) AS ?organization)
(SAMPLE(?work) AS ?example_work)
WHERE {
  ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q35 .
  ?work wdt:P50 ?author .
  OPTIONAL { ?citing_work wdt:P2860 ?work . }
  OPTIONAL {
    ?author wdt:P1416 | wdt:P108 ?organization_ .
    ?organization_ wdt:P17 wd:Q35 .
  }
}
GROUP BY ?author 

It times out on Wikidata, fails on QLever and executes on that Virtuoso instance.
Screenshot 2022-07-22 at 00-33-48 Wikidata Query Service

Screenshot 2022-07-22 at 00-32-30 The QLever SPARQL engine fast scalable with autocompletion and text search

Screenshot from 2022-07-22 00-31-54

@WolfgangFahl
Copy link
Collaborator

The query runs successfully on some of our endpoints

date;sparqlquery -qn authorsCitingWork -en blazegraph -f github;date
  • blazegraph 2018 instance (13 secs for ~786 results) Fr 22. Jul 13:41:13 CEST 2022 Fr 22. Jul 13:41:26 CEST 2022
  • jena 2020 instance ( for ~10117 results) Fr 22. Jul 13:39:32 CEST 2022 - still running via command line will report later
  • stardog 2022 instance (108 secs for ~14266 results) Fr 22. Jul 13:37:10 CEST 2022 - Fr 22. Jul 13:39:02 CEST 2022

@WolfgangFahl
Copy link
Collaborator

see ad-freiburg/qlever#859

@egonw
Copy link
Collaborator

egonw commented Mar 10, 2023

Virtuoso-on-AWS: https://wikidata.demo.openlinksw.com/sparql

(Does not support the Wikidata blazegraph functions)

@WolfgangFahl
Copy link
Collaborator

WolfgangFahl commented Mar 10, 2023

https://ceur-ws.org/Vol-3262/paper9.pdf and https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData have a list of candidates. I also intend to talk to the wikidata team on the next meeting and would love to have a proper blazegraph mirror running at our RWTH Aachen i5 http://wikidata.dbis.rwth-aachen.de/ machine which should be suitable for the task with 256 GB RAM and 10 TB SSD. I never got a proper blazegraph mirror endpoint with all necessary special services running in the past 6 years that i have been attempting to get my own copy of wikidata running.

@egonw
Copy link
Collaborator

egonw commented Mar 10, 2023

Oh, you're in Aachen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement some suggestions to improve Scholia
Projects
None yet
Development

No branches or pull requests

3 participants