Options for real-time storage of calculated features #322
Replies: 2 comments 9 replies
-
Thanks a lot for initiating this discussion @toni-neurosc. I was running some tests to check which option might be best. For that I simulated a feature dictionary with 100 columns and checked how long a save (as after feature computation in the run() method ) would take. Additionally, I also tested the following methods: SQLITE: 0.0001 seconds Actually most of them have similar speed, only .csv and HDF5 were slower. Msgpack seems to be an interesting choice, since it's directly saving to binary output, but pickle or numpy would both be good options. Code for reproduction:
and for sqlite setup and time estimation:
Tbh I did not use sqlite before, so I was not sure about the |
Beta Was this translation helpful? Give feedback.
-
Hi Timon, sorry I haven't answered in this thread for such long, I actually haven't tested your code yet... but your results for SQLite look really good, and SQLite is typically very easy to use to I fully agree with this direction. DuckDB might be more performant in some scenarios but SQLite is probably super far from being a bottleneck for us with such impressive speeds. One thing I want to add though, is that I think we should keep the option to read and write CSVs, that is, maybe having SQLite as default but storing results in CSV should still be supported as a text-based way to store data, that way we have SQLite for binary, CSV for text. I don't know that supporting other formats is really necessary, but supporting formats that support trivial appending (i.e. incremental saving) of data should not be difficult if eventually needed. This rules out Pickle, Feather and Parquet. But msgpack, json are trivially appendable and .npy can be appended but it's a bit more complicated https://pypi.org/project/npy-append-array/ |
Beta Was this translation helpful? Give feedback.
-
For the purpose of supporting real-time online feature calculation during long experiments or offline processing of big datasets, we need a way to store data efficiently. Here are some options that already came up during the weekly meeting:
Text based: your good old CSV. Bad: awful space efficienty. Good: super-convenient for anyone to work with after, since you can easily look at it or ready it anywhere (even MS Excel for that matter).
Binary encoding: classic HDF5 format, widely used in science, supported in a variety of languages, so very cross-compatible and safe for long-term storage. Open-source too: https://github.com/HDFGroup/hdf5
Specialized time-series DBs: the advantage of using a database is that it's much easier to query if specific parts of the data need to be retrieved. Also they might provide other benefits, for example InfluxDB website talks about percentage calculation which is a big bottleneck in some of the feature calculations in PyNM. Also with dome DBs there is the option to hold the database entirely in-memory, so we could potentially use it in place of Pandas.
Just because of the need of running a local DB server, the database option seems impossible for me except for SQLite. At any rate, I think ideally we should give the user the option to store the results in a variety of data-formats, for example:
- CSV
- Feather
- Pickle
- HDF5
- Parquet
- SQLite
We're already using PyArrow so barring HDF5 we can already support this. The only question would be, which ones support appending without re-loading the dataset to memory. I think only CSV and SQLite (perhaps HDF5?) support that. For the rest, the data would have to be divided in chunks.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions