Skip to content

SQlite data source for RDataFrame #2322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Aug 30, 2018

Conversation

jblomer
Copy link
Contributor

@jblomer jblomer commented Jul 16, 2018

This PR adds a new data source for RDataFrame that is able to provide data from SQlite SELECT queries. It will be useful for cvmfs, where we have file catalogs and monitoring information in sqlite files. For instance, one can do

auto rdf = ROOT::RDF::MakeSqliteDataFrame("catalog.sqlite", "select * from catalog");
auto h = rdf.Define("lname", "name.length()").Histo1D("lname");

to show the distribution of file name sizes. There are probably more use cases, for instance sqlite export of conditions data.

This is work in progress, I'm posting it for early comments and feedback.

My open points are

  • Unit and integration tests are yet to be done
  • The data source work single-threaded only at the moment. I initially thought it might be enough to return only a single row in GetEntryRanges() to make it thread-safe but that's apparently not enough. So I'm now thinking to wrap SetEntry() and GetEntryRanges() in a mutex.
  • The code is not reusing TSQLiteServer. It felt like it requires stretching the TSQLServer interface in perhaps unwanted ways, for saving only a handful of direct sqlite API calls. On the other hand, with (probably quite a bit) more work it might be possible to have a more general data source that works on any SQL result set.
  • Determining column types in SQlite is tricky as it is dynamically typed and in principle each row can have different column types. If a table column is queried as is, I can use the default/declared column type. For expressions, I'll use the type of the first row of the result set. Still it can result in a column to be of type NULL where subsequent rows actually have meaningful values. My guess is that the current heuristic is practical enough, and of course the user can formulate the SELECT query to avoid ambiguity.
  • It would not be impossible to add support for remote reading of sqlite files. To do so, one can add a custom implementation of an sqlite virtual file system to have data pouring in directly from HTTP or XRootD. This is perhaps something to keep in mind.

@jblomer jblomer changed the title WIP: SQlite data source for RDataFrame SQlite data source for RDataFrame Jul 20, 2018
@jblomer
Copy link
Contributor Author

jblomer commented Jul 20, 2018

@dpiparo @bluehood From my side this is ready for review. I think it makes sense to have remote reading of files in a possible follow-up PR. Please let me know if you think it makes sense to add an integration test to roottest. In this case, how would I deal with the fact that this datasource depends on the sqlite availability?

@eguiraud
Copy link
Contributor

hi, alright i'm back at cern on monday, i'll review asap (and let you know about the roottest, but in first approximation you can follow what rarrowds does).

@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@root-project root-project deleted a comment from phsft-bot Jul 24, 2018
@eguiraud
Copy link
Contributor

hi, cleaned up obsolete jenkins messages, will start a new build and start reviewing today

@eguiraud
Copy link
Contributor

@phsft-bot build please

@phsft-bot
Copy link

Starting build on slc6/gcc48, slc6-i686/gcc49, centos7/clang39, centos7/gcc62, centos7/gcc7, fedora28/native, ubuntu16/native, mac1013/native, windows10/vc15 with flags -Dimt=ON -Dccache=ON
How to customize builds

@jblomer
Copy link
Contributor Author

jblomer commented Aug 16, 2018

It should be rebased now. Is there a standard ROOT method for printing warnings?

@phsft-bot
Copy link

Starting build on slc6/gcc48, slc6-i686/gcc49, centos7/clang39, centos7/gcc62, centos7/gcc7, fedora28/native, ubuntu16/native, mac1013/native, windows10/vc15 with flags -Dccache=ON
How to customize builds

@phsft-bot
Copy link

Build failed on fedora28/native.
See console output.

Errors:

  • ninja: error: loading 'build.ninja': No such file or directory

@phsft-bot
Copy link

Build failed on windows10/vc15.
See console output.

Errors:

  • MSBUILD : error MSB1009: Project file does not exist.

@phsft-bot
Copy link

Build failed on centos7/gcc62.
See console output.

Failing tests:

@pcanal
Copy link
Member

pcanal commented Aug 16, 2018

Yes, TObject::Warning or ::Warning.

@phsft-bot
Copy link

Starting build on slc6/gcc48, slc6-i686/gcc49, centos7/clang39, centos7/gcc62, centos7/gcc7, fedora28/native, ubuntu16/native, mac1013/native, windows10/vc15 with flags -Dccache=ON
How to customize builds

@phsft-bot
Copy link

Build failed on windows10/vc15.
See console output.

Errors:

  • MSBUILD : error MSB1009: Project file does not exist.

@phsft-bot
Copy link

Build failed on fedora28/native.
See console output.

Errors:

  • ninja: error: loading 'build.ninja': No such file or directory

@eguiraud
Copy link
Contributor

@jblomer @dpiparo is this ready to go in?

@jblomer
Copy link
Contributor Author

jblomer commented Aug 27, 2018

@bluehood Yes from my side

@eguiraud
Copy link
Contributor

@phsft-bot build

@phsft-bot
Copy link

Starting build on slc6/gcc48, slc6-i686/gcc49, centos7/clang39, centos7/gcc62, centos7/gcc7, fedora28/native, ubuntu16/native, mac1013/native, windows10/vc15 with flags -Dccache=ON
How to customize builds

@phsft-bot
Copy link

Build failed on fedora28/native.
See console output.

Errors:

  • ninja: error: loading 'build.ninja': No such file or directory

@phsft-bot
Copy link

Build failed on windows10/vc15.
See console output.

Errors:

  • MSBUILD : error MSB1009: Project file does not exist.

@eguiraud
Copy link
Contributor

@jblomer @dpiparo both build failures are due to sqlite not being present on the machine.
@amadio suggests that the datasource be made optional in tree/dataframe/CMakeLists.txt, so it's not built when sqlite is not available.

@jblomer
Copy link
Contributor Author

jblomer commented Aug 29, 2018

@bluehood @amadio @dpiparo The CMakeLists.txt file should already turn off this data source when sqlite is not available. Am I missing something?

@@ -22,11 +22,16 @@ if(NOT ARROW_FOUND)
list(REMOVE_ITEM sources ${CMAKE_CURRENT_SOURCE_DIR}/src/RArrowDS.cxx)
endif()

if(NOT sqlite)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may need to use ${ROOT_sqlite_FOUND} or equivalent here and elsewhere.

@phsft-bot
Copy link

Starting build on slc6/gcc48, slc6-i686/gcc49, centos7/clang39, centos7/gcc62, centos7/gcc7, fedora28/native, ubuntu16/native, mac1013/native, windows10/vc15 with flags -Dccache=ON
How to customize builds

@phsft-bot
Copy link

Build failed on fedora28/native.
See console output.

Errors:

  • ninja: error: loading 'build.ninja': No such file or directory

@phsft-bot
Copy link

Starting build on slc6/gcc48, slc6-i686/gcc49, centos7/clang39, centos7/gcc62, centos7/gcc7, fedora28/native, ubuntu16/native, mac1013/native, windows10/vc15 with flags -Dccache=ON
How to customize builds

@phsft-bot
Copy link

Starting build on slc6/gcc48, slc6-i686/gcc49, centos7/clang39, centos7/gcc62, centos7/gcc7, fedora28/native, ubuntu16/native, mac1013/native, windows10/vc15 with flags -Dccache=ON
How to customize builds

@eguiraud eguiraud merged commit 4c45a6d into root-project:master Aug 30, 2018
@eguiraud
Copy link
Contributor

thank you @jblomer for being so responsive to comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants