-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nm database #350
Nm database #350
Conversation
SamedVossberg
commented
Jun 26, 2024
- Integration of nm_database (new module)
- changes in run method -> can now receive flag "save_csv" to save full df as csv (otherwise only head)
- adjusted examples (save csv in run methods as I think it makes sense for the examples if people can look into the full csv files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those changes @SamedVossberg. I have a few more code requests related to typehints in the function declarations, and making the number of samples / time interval between saves a function parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@toni-neurosc Can you check if this PR is in line with your current GUI backend setup? It looks fine in my opinion
py_neuromodulation/nm_stream.py
Outdated
if self.verbose: | ||
if last_time is not None: | ||
logger.debug("%.3f seconds of new data processed", time_[-1] - last_time) | ||
feature_dict["time"] = time_[-1] - last_time | ||
else: | ||
feature_dict["time"] = 0 | ||
last_time = time_[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the recording of the time is not quite right here. The logger should be outputting the seconds processed for the first batch, and the recorded time should be either the start or the end of the batch, not the length of it. I have attempted a fix in my commit, check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here the intention was to say how much data was processed in the new batch, since this can actually vary with LSL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, did you already push the commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, I'm doing a bunch of other changes. My point here is, there are 2 things you might want to calculate:
- How much data you have process, in terms of time period: for this you of course take the difference between the last time of the previous batch, and the first of the current batch. You display it in the logger for sure, but you can also save it, but where?
- The timestamp of each sample: Since the timestamp for each single data point gets lost during feature calculation (due to resampling, averaging, and other transformations) you would default to using either the first timestamp of the batch, the last, or maybe the average.
Now the question would be, which of these belongs in the features_df["time"]
field? Back in the office I proposed having 2 separate columns, even 3 (an additional 1 for batch number or processing timestamp).
py_neuromodulation/nm_stream_abc.py
Outdated
@@ -155,6 +156,9 @@ def save_after_stream( | |||
|
|||
self.save_sidecar(out_path_root, folder_name) | |||
|
|||
if not save_csv: | |||
feature_arr = feature_arr.head() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to not load the whole dataframe and subset it here, we can read just the first line from the SQLite database and pass it to this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have made a bunch of changes:
- Got rid of the
out_path_root
andfolder_name
in all the internal functions, and handled everything at the Stream.run level. - Removed the cast_values function from the database class as I don't think it belongs there, I just inlined it in the stream.run method
- I made some optimizations to the database after looking some articles on SQLite optimization, mainly to make sure we're going fast but also not losing data in case of crashes or system failures
- Moved the create_table function inside insert_data and the corresponding flag to the database class to not have internal database state mixed up with the stream class
- Added a head() method to the database class to read only 1 line from the database and avoid reading all the data to save just the header.
- Changed the code relating to the timestamps, more details in the review comments
I'm fine when it comes to the GUI side, I will merge this into the gui
branch and if required make changes there.
Btw, I made some tests fail somehow, will commit the fixes later.
I also have a big refactoring in the works that I almost finished when I was in Berlin, I'll try to finish that ASAP and make a PR because I don't want the GUI branch diverging too much from main either.
|
||
self.save_after_stream(out_path_root, folder_name, feature_df, save_csv=save_csv) | ||
|
||
db.close() | ||
|
||
return feature_df | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if returning the dataframe makes a lot of sense anymore, since it's only holding the 1 row when save_csv = False. I know it's needed for some examples, but maybe for offline analysis or custom analysis we should provide a different interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think it's still nice interface to kind of see the structure what features were calculated for which channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, for long recordings, the moment the recording finishes it will try to load the entire database in memory and if it's too big it will hand or crash. Luckily thanks to the db itself this wouldn't cause a data-loss, but it's kinda awkward. For now I have added a return_df
parameter True by default.
py_neuromodulation/nm_database.py
Outdated
self.csv_path = Path(csv_path) | ||
|
||
if os.path.exists(self.db_path): | ||
os.remove(self.db_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to throw an error here than straigh-up deleting a previous file if it exists, this could lead to unwanted data losses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very elegant solution with the incremental file names!
py_neuromodulation/nm_stream.py
Outdated
if self.verbose: | ||
if last_time is not None: | ||
logger.debug("%.3f seconds of new data processed", time_[-1] - last_time) | ||
feature_dict["time"] = time_[-1] - last_time | ||
else: | ||
feature_dict["time"] = 0 | ||
last_time = time_[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, I'm doing a bunch of other changes. My point here is, there are 2 things you might want to calculate:
- How much data you have process, in terms of time period: for this you of course take the difference between the last time of the previous batch, and the first of the current batch. You display it in the logger for sure, but you can also save it, but where?
- The timestamp of each sample: Since the timestamp for each single data point gets lost during feature calculation (due to resampling, averaging, and other transformations) you would default to using either the first timestamp of the batch, the last, or maybe the average.
Now the question would be, which of these belongs in the features_df["time"]
field? Back in the office I proposed having 2 separate columns, even 3 (an additional 1 for batch number or processing timestamp).
if not data_acquired: | ||
db.create_table(feature_dict) | ||
data_acquired = True | ||
db.insert_data(feature_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the table creation logic into the database class
py_neuromodulation/nm_stream.py
Outdated
|
||
l_features.append(feature_dict) | ||
feature_dict = db.cast_values(feature_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that casting is not necessary, we should ensure that the feature outputs are always np.float64, and even if casting is necessary, I don't think the logic for casting should be inside of the database class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
py_neuromodulation/nm_stream.py
Outdated
@@ -134,23 +140,38 @@ def _run( | |||
data_batch.astype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized we don't need this, mne_lsl returns the data in np.float64
format and astype
actually makes a whole copy of the data, which is not good for performance
py_neuromodulation/nm_stream.py
Outdated
@@ -134,23 +140,38 @@ def _run( | |||
data_batch.astype(np.float64) | |||
) | |||
if is_stream_lsl: | |||
feature_dict["time"] = time_[-1] | |||
if self.verbose: | |||
if last_time is not None: | |||
logger.debug("%.3f seconds of new data processed", time_[-1] - last_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think DEBUG level messages should depend on verbose, since DEBUG usually is disabled by default unless you set the level of logging to DEBUG explicitely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
|
||
while True: | ||
next_item = next(generator, None) | ||
|
||
if next_item is not None: | ||
time_, data_batch = next_item | ||
timestamps, data_batch = next_item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this from time_ to timestamps, although I'm not sure is consistent with RawGenerator at this point
feature_dict["time"] = ( | ||
batch_length if is_stream_lsl else np.ceil(this_batch_end * 1000 + 1) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure this is right for 2 reasons:
- the "time" field should be consistent between file streams and lsl_streams, not having a duration for one and a time point for the other
- Not sure if the units are consistent between both types of streams right now
- Replace hyphens with underscores for re-referenced channel names
e38486a
to
a132fea
Compare
0403734
to
c0d9ff1
Compare