-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add usecols and flatten to to_dataframe feature. #330
Conversation
Codecov Report
@@ Coverage Diff @@
## master #330 +/- ##
==========================================
+ Coverage 76.20% 76.24% +0.04%
==========================================
Files 43 43
Lines 7090 7103 +13
==========================================
+ Hits 5403 5416 +13
Misses 1687 1687
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is basically what I had in mind. The solution is pretty simple, but it wasn't clear to me from #327 that we had reached consensus on whether or not this was in scope or not. @csadorf what are your thoughts?
If we do choose to move forward with this, I don't want to support both whitelists and blacklists. Constructing one from another should be easy (something like desired_keys = set.union([list(job.doc.keys()) for job in project].difference(undesired_keys)
), and I find APIs that support both are invariably confusing to users no matter how well they are documented because of questions on how the two interact. My personal preference would be for a whitelist, because I think that's the much more likely use case (especially if our focus is on the "large job documents" case), but I could be convinced otherwise.
@vyasr I think adding this feature is in scope. I disagree about supporting only whitelists or only blacklists. I don't think that "constructing one from another should be easy." The one-liner that you wrote is pretty complicated and I think this is a reasonable convenience feature for users. I tried to address the potential for ambiguity on how the two interact by explicitly stating that "excludes override includes" in the docs. |
Since we're dealing with pandas here, we should use pandas as a guide. All of its I'm a bit confused since you were originally of the viewpoint that this was not in scope for signac. Was it was my argument about large files and performance that changed your mind? If so, since Simon was also originally against this I'd be curious what his thoughts are now. |
@vyasr I like your idea to mirror the You're correct, I changed my mind about scope after considering the poor performance that would result from adding unwanted job document fields. Job documents are frequently large and there's no need to copy all of that into the dataframe if it's not requested. |
Also in favor if mirroring the pandas API as much as reasonable/possible. 👍 |
@bdice I'm taking @csadorf's last comment as tacit approval of moving forward. Since the two of you were the ones who were originally against I think we're good to move forward. Once you make the On the issue, you also raised the question of |
Also in favor of having the option to automatically flatten. Working with data that has more than two dimensions in data frames is an absolute pain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion for the doc-string, otherwise LGTM! 👍
@vyasr You can merge this when approved. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I edited docstrings a tiny bit, otherwise ready to merge.
Description
Add feature for including/excluding state point and document keys from an exported pandas DataFrame.
Motivation and Context
Resolves #327.
Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary:
Example for a changelog entry:
Fix issue with launching rockets to the moon (#101, #212).