Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nm database #350

Merged
merged 33 commits into from
Aug 1, 2024
Merged

Nm database #350

merged 33 commits into from
Aug 1, 2024

Conversation

SamedVossberg
Copy link
Contributor

  • Integration of nm_database (new module)
  • changes in run method -> can now receive flag "save_csv" to save full df as csv (otherwise only head)
  • adjusted examples (save csv in run methods as I think it makes sense for the examples if people can look into the full csv files)

Copy link
Contributor

@timonmerk timonmerk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those changes @SamedVossberg. I have a few more code requests related to typehints in the function declarations, and making the number of samples / time interval between saves a function parameter.

py_neuromodulation/nm_database.py Outdated Show resolved Hide resolved
py_neuromodulation/nm_database.py Outdated Show resolved Hide resolved
py_neuromodulation/nm_stream.py Outdated Show resolved Hide resolved
tests/test_all_features.py Show resolved Hide resolved
Copy link
Contributor

@timonmerk timonmerk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toni-neurosc Can you check if this PR is in line with your current GUI backend setup? It looks fine in my opinion

if self.verbose:
if last_time is not None:
logger.debug("%.3f seconds of new data processed", time_[-1] - last_time)
feature_dict["time"] = time_[-1] - last_time
else:
feature_dict["time"] = 0
last_time = time_[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the recording of the time is not quite right here. The logger should be outputting the seconds processed for the first batch, and the recorded time should be either the start or the end of the batch, not the length of it. I have attempted a fix in my commit, check it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here the intention was to say how much data was processed in the new batch, since this can actually vary with LSL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, did you already push the commit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I'm doing a bunch of other changes. My point here is, there are 2 things you might want to calculate:

  • How much data you have process, in terms of time period: for this you of course take the difference between the last time of the previous batch, and the first of the current batch. You display it in the logger for sure, but you can also save it, but where?
  • The timestamp of each sample: Since the timestamp for each single data point gets lost during feature calculation (due to resampling, averaging, and other transformations) you would default to using either the first timestamp of the batch, the last, or maybe the average.

Now the question would be, which of these belongs in the features_df["time"] field? Back in the office I proposed having 2 separate columns, even 3 (an additional 1 for batch number or processing timestamp).

@@ -155,6 +156,9 @@ def save_after_stream(

self.save_sidecar(out_path_root, folder_name)

if not save_csv:
feature_arr = feature_arr.head()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to not load the whole dataframe and subset it here, we can read just the first line from the SQLite database and pass it to this function

Copy link
Collaborator

@toni-neurosc toni-neurosc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have made a bunch of changes:

  • Got rid of the out_path_root and folder_name in all the internal functions, and handled everything at the Stream.run level.
  • Removed the cast_values function from the database class as I don't think it belongs there, I just inlined it in the stream.run method
  • I made some optimizations to the database after looking some articles on SQLite optimization, mainly to make sure we're going fast but also not losing data in case of crashes or system failures
  • Moved the create_table function inside insert_data and the corresponding flag to the database class to not have internal database state mixed up with the stream class
  • Added a head() method to the database class to read only 1 line from the database and avoid reading all the data to save just the header.
  • Changed the code relating to the timestamps, more details in the review comments

I'm fine when it comes to the GUI side, I will merge this into the gui branch and if required make changes there.

Btw, I made some tests fail somehow, will commit the fixes later.

I also have a big refactoring in the works that I almost finished when I was in Berlin, I'll try to finish that ASAP and make a PR because I don't want the GUI branch diverging too much from main either.


self.save_after_stream(out_path_root, folder_name, feature_df, save_csv=save_csv)

db.close()

return feature_df

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if returning the dataframe makes a lot of sense anymore, since it's only holding the 1 row when save_csv = False. I know it's needed for some examples, but maybe for offline analysis or custom analysis we should provide a different interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think it's still nice interface to kind of see the structure what features were calculated for which channels

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is, for long recordings, the moment the recording finishes it will try to load the entire database in memory and if it's too big it will hand or crash. Luckily thanks to the db itself this wouldn't cause a data-loss, but it's kinda awkward. For now I have added a return_df parameter True by default.

self.csv_path = Path(csv_path)

if os.path.exists(self.db_path):
os.remove(self.db_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to throw an error here than straigh-up deleting a previous file if it exists, this could lead to unwanted data losses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very elegant solution with the incremental file names!

if self.verbose:
if last_time is not None:
logger.debug("%.3f seconds of new data processed", time_[-1] - last_time)
feature_dict["time"] = time_[-1] - last_time
else:
feature_dict["time"] = 0
last_time = time_[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I'm doing a bunch of other changes. My point here is, there are 2 things you might want to calculate:

  • How much data you have process, in terms of time period: for this you of course take the difference between the last time of the previous batch, and the first of the current batch. You display it in the logger for sure, but you can also save it, but where?
  • The timestamp of each sample: Since the timestamp for each single data point gets lost during feature calculation (due to resampling, averaging, and other transformations) you would default to using either the first timestamp of the batch, the last, or maybe the average.

Now the question would be, which of these belongs in the features_df["time"] field? Back in the office I proposed having 2 separate columns, even 3 (an additional 1 for batch number or processing timestamp).

if not data_acquired:
db.create_table(feature_dict)
data_acquired = True
db.insert_data(feature_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the table creation logic into the database class


l_features.append(feature_dict)
feature_dict = db.cast_values(feature_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that casting is not necessary, we should ensure that the feature outputs are always np.float64, and even if casting is necessary, I don't think the logic for casting should be inside of the database class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

py_neuromodulation/nm_database.py Outdated Show resolved Hide resolved
@@ -134,23 +140,38 @@ def _run(
data_batch.astype(np.float64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized we don't need this, mne_lsl returns the data in np.float64 format and astype actually makes a whole copy of the data, which is not good for performance

@@ -134,23 +140,38 @@ def _run(
data_batch.astype(np.float64)
)
if is_stream_lsl:
feature_dict["time"] = time_[-1]
if self.verbose:
if last_time is not None:
logger.debug("%.3f seconds of new data processed", time_[-1] - last_time)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think DEBUG level messages should depend on verbose, since DEBUG usually is disabled by default unless you set the level of logging to DEBUG explicitely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch


while True:
next_item = next(generator, None)

if next_item is not None:
time_, data_batch = next_item
timestamps, data_batch = next_item
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this from time_ to timestamps, although I'm not sure is consistent with RawGenerator at this point

Comment on lines +153 to +155
feature_dict["time"] = (
batch_length if is_stream_lsl else np.ceil(this_batch_end * 1000 + 1)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure this is right for 2 reasons:

  • the "time" field should be consistent between file streams and lsl_streams, not having a duration for one and a time point for the other
  • Not sure if the units are consistent between both types of streams right now

@toni-neurosc toni-neurosc merged commit 88d15aa into main Aug 1, 2024
6 checks passed
@toni-neurosc toni-neurosc deleted the nm_database branch August 29, 2024 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants