Options for real-time storage of calculated features #322

toni-neurosc · 2024-05-05T10:21:54Z

toni-neurosc
May 5, 2024
Collaborator

For the purpose of supporting real-time online feature calculation during long experiments or offline processing of big datasets, we need a way to store data efficiently. Here are some options that already came up during the weekly meeting:

Text based: your good old CSV. Bad: awful space efficienty. Good: super-convenient for anyone to work with after, since you can easily look at it or ready it anywhere (even MS Excel for that matter).
Binary encoding: classic HDF5 format, widely used in science, supported in a variety of languages, so very cross-compatible and safe for long-term storage. Open-source too: https://github.com/HDFGroup/hdf5
Specialized time-series DBs: the advantage of using a database is that it's much easier to query if specific parts of the data need to be retrieved. Also they might provide other benefits, for example InfluxDB website talks about percentage calculation which is a big bottleneck in some of the feature calculations in PyNM. Also with dome DBs there is the option to hold the database entirely in-memory, so we could potentially use it in place of Pandas.
- Howerver, the big disadvantage is that we need to be running a server along with the main Python script. This is probably a deal-breaker. This is not a problem with a single-file db like SQLite.

Just because of the need of running a local DB server, the database option seems impossible for me except for SQLite. At any rate, I think ideally we should give the user the option to store the results in a variety of data-formats, for example:
- CSV
- Feather
- Pickle
- HDF5
- Parquet
- SQLite

We're already using PyArrow so barring HDF5 we can already support this. The only question would be, which ones support appending without re-loading the dataset to memory. I think only CSV and SQLite (perhaps HDF5?) support that. For the rest, the data would have to be divided in chunks.

Thoughts?

timonmerk · 2024-05-05T15:15:28Z

timonmerk
May 5, 2024
Maintainer

Thanks a lot for initiating this discussion @toni-neurosc.

I was running some tests to check which option might be best. For that I simulated a feature dictionary with 100 columns and checked how long a save (as after feature computation in the run() method ) would take.
Tbh I was most curious how fast sqlite would take, and could find that a single INSERT with the series takes around 100 us, which was until now also the fastest way for storing.

Additionally, I also tested the following methods:
CSV: 0.0012 seconds
Numpy: 0.0005 seconds
Pickle: 0.0005 seconds
HDF5: 0.0053 seconds
Msgpack: 0.0004 seconds
JSON: 0.0005 seconds

SQLITE: 0.0001 seconds

Actually most of them have similar speed, only .csv and HDF5 were slower. Msgpack seems to be an interesting choice, since it's directly saving to binary output, but pickle or numpy would both be good options.

Code for reproduction:

import pandas as pd
import numpy as np
import timeit
import msgpack

num_saves = 1000
data = np.random.randint(0, 100, num_saves)
data_dict = {f"key_{i}": data[i] for i in range(len(data))}
s = pd.Series(data)

time_taken = timeit.timeit(lambda: s.to_csv('test.csv'), number=1000)
print(f"CSV: {np.round(time_taken/1000, 4)} seconds")

# test now the same thing but test the save with np.save
time_taken = timeit.timeit(lambda: np.save('test.npy', data), number=1000)
print(f"Numpy: {np.round(time_taken/1000, 4)} seconds")

# test now the same thing but test the save with binary format
time_taken = timeit.timeit(lambda: s.to_pickle('test.pkl'), number=1000)
print(f"Pickle: {np.round(time_taken/1000, 4)} seconds")

# test now the same thing but test the save with hdf5 format
time_taken = timeit.timeit(lambda: s.to_hdf('test.h5', key='df'), number=1000)
print(f"HDF5: {np.round(time_taken/1000, 4)} seconds")

# test now the same thing but test the save with msgpack format

l = [int(np.random.randint(0, 100, 1)) for _ in range(100)]
def write_msg():
    with open('data.msgpack', 'wb') as f:
        packed  = msgpack.packb(l, use_bin_type=True)
        f.write(packed)

time_taken = timeit.timeit(write_msg, number=1000)
print(f"Msgpack: {np.round(time_taken/1000, 4)} seconds")

# test now the same thing but test the save with json format
# transfer the data to a list with random keys for dictionary

# measure tim to save this dict to json
import json
time_taken = timeit.timeit(lambda: json.dump(l, open('test.json', 'w')), number=1000)

print(f"JSON: {np.round(time_taken/1000, 4)}, seconds")

and for sqlite setup and time estimation:

import sqlite3
import pandas as pd
import numpy as np
import time

# Create a DataFrame with 100 columns and some random data
num_rows = 1000
num_columns = 100
columns = [f"col_{i}" for i in range(num_columns)]
df = pd.DataFrame(np.random.randn(num_rows, num_columns), columns=columns)

# Establish an SQLite connection and cursor
conn = sqlite3.connect('example_database.db')
cursor = conn.cursor()

# Create a table schema dynamically based on DataFrame columns
columns_schema = ', '.join([f"{col} REAL" for col in df.columns])
create_table_query = f"CREATE TABLE IF NOT EXISTS example_table ({columns_schema})"
cursor.execute(create_table_query)

# Function to write a row into the SQLite database
def write_row_to_db(row):
    placeholders = ', '.join(['?' for _ in range(len(row))])
    insert_query = f"INSERT INTO example_table VALUES ({placeholders})"
    cursor.execute(insert_query, tuple(row))

# Time 100 writes
# start_time = time.time()
for i in range(1000):
    write_row_to_db(df.iloc[i])

conn.commit()  # Commit all writes to the database

# read everything back into a DataFrame
#df_from_db = pd.read_sql_query("SELECT * FROM example_table", conn)

import timeit
# Using timeit to measure the time taken for 1000 writes
time_taken = timeit.timeit(lambda: [write_row_to_db(df.iloc[i]) for i in range(1)], number=1000)

print(f"Time taken for 1000 writes using timeit: {np.round(time_taken/1000, 4)} seconds")

# Close the database connection
cursor.close()
conn.close()

Tbh I did not use sqlite before, so I was not sure about the conn.commit() call. Actually this call makes the computation by a factor of at least 2 slower. But, I checked with df_from_db = pd.read_sql_query("SELECT * FROM example_table", conn) and the data is already stored without the commit call .
Pandas seems to support sql queries also quite well. So I think it might be good for our use case. What do you think @toni-neurosc?

1 reply

timonmerk Jun 13, 2024
Maintainer

Just to notify you @toni-neurosc, I've discussed yesterday with @SamedVossberg that an optimal solution forward would be to implement sqlite. This should result in database writes for every computed feature batch.
There would be also a new module nm_db.py where simple pd.read_db functions would be implemented that allows the used still to work with pd Dataframes and write out an example csv file that contains only the header to be human readable.

Somehow the sqlite write performance motivated that choice for me. We should then discuss also how a GUI framework could interact with that continuously updated database.

By this we also finally get around the accumulated RAM storage issue.

toni-neurosc · 2024-06-13T11:24:39Z

toni-neurosc
Jun 13, 2024
Collaborator Author

Hi Timon, sorry I haven't answered in this thread for such long, I actually haven't tested your code yet... but your results for SQLite look really good, and SQLite is typically very easy to use to I fully agree with this direction. DuckDB might be more performant in some scenarios but SQLite is probably super far from being a bottleneck for us with such impressive speeds.

One thing I want to add though, is that I think we should keep the option to read and write CSVs, that is, maybe having SQLite as default but storing results in CSV should still be supported as a text-based way to store data, that way we have SQLite for binary, CSV for text. I don't know that supporting other formats is really necessary, but supporting formats that support trivial appending (i.e. incremental saving) of data should not be difficult if eventually needed. This rules out Pickle, Feather and Parquet. But msgpack, json are trivially appendable and .npy can be appended but it's a bit more complicated https://pypi.org/project/npy-append-array/
HDF5 is too much format for our purposes, as we only need 1 dataset / file and we have homogeneous data.

8 replies

timonmerk Jun 13, 2024
Maintainer

Agree! Really nice that the to_sql function is so fast.
Tbh I also think at one point we have to deprecate multiprocessing.. For very heavy analysis I performed the parallelisation now also across time. So split e.g. csv data using parquet and run those in parallel. So it would be best to focus for this approach on sequential processing.

toni-neurosc Jun 13, 2024
Collaborator Author

So I wanted to test against Polars, since I've been wanting to see how it performs and since this use-case was pretty isolated (just creating disposable dataframes and saving them to CSV or SQLite()) I wrote some code to benchmark Polars vs Python for this:

The code I used, derived from yours:

import sqlite3
import pandas as pd
import polars as pl
import numpy as np
import timeit
import os
import random


# Establish an SQLite connection and cursor
if os.path.exists("example_database.db"):
    os.remove("example_database.db")
conn = sqlite3.connect("example_database.db", autocommit=False)
cursor = conn.cursor()

def empty_table():
    cursor.execute("DELETE FROM example_table")
    conn.commit()
    
# Create a dictionary that represents the feature calculation result from a single batch of data
num_columns = 100
columns = [f"col_{i}" for i in range(num_columns)]
feature_dict = {col: random.random() for col in columns}

# Create a table schema dynamically based on DataFrame columns
columns_schema = ", ".join([f"{col} REAL" for col in columns])
cursor.execute(f"CREATE TABLE IF NOT EXISTS example_table ({columns_schema})")

query = f"INSERT INTO example_table VALUES ({', '.join(['?' for _ in range(len(columns))])})"
B = 1000  # Simulate 1000 batches
N = 100  # Save every 100 batches


def simulate_run_pandas():
    result_list = []
    for i in range(B):
        result_list.append(feature_dict)
        if len(result_list) == N:
            df = pd.DataFrame.from_records(result_list).to_numpy(dtype=np.float64)
            cursor.executemany(query, df.tolist())
            conn.commit()
            result_list = []


def simulate_run_pandas_tosql():
    result_list = []
    for i in range(B):
        result_list.append(feature_dict)
        if len(result_list) == N:
            df = pd.DataFrame.from_records(result_list).astype(np.float64)
            df.to_sql("example_table", conn, if_exists="append", index=False)
            result_list = []


def simulate_run_polars():
    result_list = []
    for i in range(B):
        result_list.append(feature_dict)
        if len(result_list) == N:
            df = pl.from_records(result_list).cast(pl.Float64())
            cursor.executemany(query, df.iter_rows())
            conn.commit()
            result_list = []

num_reps = 100

print(
    f"\nWrite {N} rows at a time with executemany, using Pandas to generate DataFrame"
)
time_taken = timeit.timeit(simulate_run_pandas, number=num_reps)
print(f"Time taken for 1000 writes: {round(time_taken/num_reps, 4)} seconds")
print(f"Time taken per record: {time_taken/(num_reps*B):.7f} seconds")

empty_table()

print(
    f"\nWrite {N} rows at a time with Pandas.to_sql, using Pandas to generate DataFrame"
)
time_taken = timeit.timeit(simulate_run_pandas_tosql, number=num_reps)
print(f"Time taken for 1000 writes: {round(time_taken/num_reps, 4)} seconds")
print(f"Time taken per record: {time_taken/(num_reps*B):.7f} seconds")

empty_table()

print(
    f"\nWrite {N} rows at a time with executemany, using Polars to generate DataFrame"
)
time_taken = timeit.timeit(simulate_run_polars, number=num_reps)
print(f"Time taken for 1000 writes: {round(time_taken/num_reps, 4)} seconds")
print(f"Time taken per record: {time_taken/(num_reps*B):.7f} seconds")

# Close the database connection
cursor.close()
conn.close()

and the results

B = 1000
N = 100
Write 100 rows at a time with executemany, using Pandas to generate DataFrame
Time taken for 1000 writes: 0.0671 seconds
Time taken per record: 0.0000671 seconds

Write 100 rows at a time with Pandas.to_sql, using Pandas to generate DataFrame
Time taken for 1000 writes: 0.1425 seconds
Time taken per record: 0.0001425 seconds

Write 100 rows at a time with executemany, using Polars to generate DataFrame
Time taken for 1000 writes: 0.0805 seconds
Time taken per record: 0.0000805 seconds

B = 10000
N = 1000
Write 1000 rows at a time with executemany, using Pandas to generate DataFrame
Time taken for 10000 writes: 0.2448 seconds
Time taken per record: 0.0000245 seconds

Write 1000 rows at a time with Pandas.to_sql, using Pandas to generate DataFrame
Time taken for 10000 writes: 0.3585 seconds
Time taken per record: 0.0000359 seconds

Write 1000 rows at a time with executemany, using Polars to generate DataFrame
Time taken for 10000 writes: 0.2836 seconds
Time taken per record: 0.0000284 seconds

executemany is fastest, to_sql is like 50% slower. Polars is a bit slower here, maybe because of the underlying data representation which users Apache Arrow format and is a bit less native to Python, while Pandas uses Numpy. However, maybe it's interesting to use Polars for writing CSV, since PyArrow is much faster at that.

toni-neurosc Jun 13, 2024
Collaborator Author

Ok, so I tested that last hypothesis I mentioned about Polars being faster at writing CSVs and indeed, it's not even close:

B = 1000
N = 100
Write 100 rows at a time to CSV, using Pandas to generate DataFrame
Time taken for 1000 writes: 0.1343 seconds
Time taken per record: 0.0001343 seconds

Write 100 rows at a time to CSV, using Polars to generate DataFrame
Time taken for 1000 writes: 0.0044 seconds
Time taken per record: 0.0000044 seconds

B = 10000
N = 1000
Write 1000 rows at a time to CSV, using Pandas to generate DataFrame
Time taken for 10000 writes: 1.0202 seconds
Time taken per record: 0.0001020 seconds

Write 1000 rows at a time to CSV, using Polars to generate DataFrame
Time taken for 10000 writes: 0.0186 seconds
Time taken per record: 0.0000019 seconds

Code:

import pandas as pd
import polars as pl
import numpy as np
import timeit
import os
import random

fname = "example_database.csv"
def delete_file(fname):
    if os.path.exists(fname):
        os.remove(fname)

# Create a dictionary that represents the feature calculation result from a single batch of data
num_columns = 100
columns = [f"col_{i}" for i in range(num_columns)]
feature_dict = {col: random.random() for col in columns}

B = 10000  # Simulate 1000 batches
N = 1000  # Save every 100 batches

def simulate_run_pandas():
    result_list = []
    for i in range(B):
        result_list.append(feature_dict)
        if len(result_list) == N:
            df = pd.DataFrame.from_records(result_list).astype(np.float64)
            df.to_csv(fname, mode="a", header=not os.path.exists(fname))
            result_list = []


def simulate_run_polars():
    result_list = []
    for i in range(B):
        result_list.append(feature_dict)
        if len(result_list) == N:
            df = pl.from_records(result_list).cast(pl.Float64())
            write_header = not os.path.exists(fname)
            with open(fname, mode="a") as f:
                df.write_csv(f, include_header=write_header)


num_reps = 100

delete_file(fname)
print(
    f"\nWrite {N} rows at a time to CSV, using Pandas to generate DataFrame"
)
time_taken = timeit.timeit(simulate_run_pandas, number=num_reps)
print(f"Time taken for {B} writes: {round(time_taken/num_reps, 4)} seconds")
print(f"Time taken per record: {time_taken/(num_reps*B):.7f} seconds")


delete_file(fname)
print(
    f"\nWrite {N} rows at a time to CSV, using Polars to generate DataFrame"
)
time_taken = timeit.timeit(simulate_run_polars, number=num_reps)
print(f"Time taken for {B} writes: {round(time_taken/num_reps, 4)} seconds")
print(f"Time taken per record: {time_taken/(num_reps*B):.7f} seconds")

SamedVossberg Aug 29, 2024
Maintainer

So I think the database is fully implemented with #350 and therefore this discussion be closed, right @toni-neurosc @timonmerk ?

timonmerk Sep 4, 2024
Maintainer

Agree! We'll probably have to make some real-time tests and potentially need to discuss the SQL backend again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for real-time storage of calculated features #322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Options for real-time storage of calculated features #322

toni-neurosc May 5, 2024 Collaborator

Replies: 2 comments · 9 replies

timonmerk May 5, 2024 Maintainer

timonmerk Jun 13, 2024 Maintainer

toni-neurosc Jun 13, 2024 Collaborator Author

timonmerk Jun 13, 2024 Maintainer

toni-neurosc Jun 13, 2024 Collaborator Author

toni-neurosc Jun 13, 2024 Collaborator Author

SamedVossberg Aug 29, 2024 Maintainer

timonmerk Sep 4, 2024 Maintainer

toni-neurosc
May 5, 2024
Collaborator

Replies: 2 comments 9 replies

timonmerk
May 5, 2024
Maintainer

timonmerk Jun 13, 2024
Maintainer

toni-neurosc
Jun 13, 2024
Collaborator Author

timonmerk Jun 13, 2024
Maintainer

toni-neurosc Jun 13, 2024
Collaborator Author

toni-neurosc Jun 13, 2024
Collaborator Author

SamedVossberg Aug 29, 2024
Maintainer

timonmerk Sep 4, 2024
Maintainer