Skip to content

Move nested CV into a separate file #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 13, 2023
Merged

Move nested CV into a separate file #4

merged 2 commits into from
Jul 13, 2023

Conversation

uabua
Copy link
Collaborator

@uabua uabua commented Jul 13, 2023

I've exported the nested CV to a separate file, but I'm not entirely satisfied with the current structure. The process involves first manipulating the data in the notebook, saving it as a pickle, then running a separate script, and finally bringing the results back to this notebook. Do you think it would be better to have the data processing as a separate script?

Additionally, I've been struggling to calculate memory usage, so I could not do it. About the timing, right now, I'm tracking the time for each model within the outer loop. So, if we have, let's say, 100 iterations, we end up with separate entries in the results dictionary for each model and dataset (including taxonomic levels). However, I'm wondering if it would be more appropriate to have a cumulative time for each model type instead. For example, all the Random Forests together. What do you think?

@uabua uabua requested a review from natasha-dudek July 13, 2023 12:59
@uabua uabua self-assigned this Jul 13, 2023
Copy link
Collaborator

@natasha-dudek natasha-dudek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question 1: I've exported the nested CV to a separate file, but I'm not entirely satisfied with the current structure. The process involves first manipulating the data in the notebook, saving it as a pickle, then running a separate script, and finally bringing the results back to this notebook. Do you think it would be better to have the data processing as a separate script?

Since this is a tutorial, I think it is important to walk through the pre-processing steps in the notebook itself. It is kind of inconvenient for us as the makers of the tutorial, but I don't think there is a good workaround in this situation.

Question 2: Additionally, I've been struggling to calculate memory usage, so I could not do it.

No worries, thanks for looking into it.

Question 3: About the timing, right now, I'm tracking the time for each model within the outer loop. So, if we have, let's say, 100 iterations, we end up with separate entries in the results dictionary for each model and dataset (including taxonomic levels). However, I'm wondering if it would be more appropriate to have a cumulative time for each model type instead. For example, all the Random Forests together. What do you think?

Let's keep them as is. We can always add them up to find the cumulative time for each model type, should we want to calculate that.

Minor change request: I just noticed an oversight of mine from earlier on - please remove BMI from the feature set we use to train the model (since BMI is calculated from weight and height, which are both in there).

Aside from that, looks good.

@uabua uabua requested a review from natasha-dudek July 13, 2023 19:47
@uabua
Copy link
Collaborator Author

uabua commented Jul 13, 2023

I removed "BMI" from the features, as suggested. 🚀

Copy link
Collaborator

@natasha-dudek natasha-dudek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks!

@uabua uabua merged commit 33ed135 into main Jul 13, 2023
@uabua uabua deleted the f/separate_nested_cv branch July 13, 2023 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants