Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add usecols and flatten to to_dataframe feature. #330

Merged
merged 12 commits into from
May 28, 2020

Conversation

bdice
Copy link
Member

@bdice bdice commented Apr 28, 2020

Description

Add feature for including/excluding state point and document keys from an exported pandas DataFrame.

Motivation and Context

Resolves #327.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

@bdice bdice requested review from a team as code owners April 28, 2020 16:25
@bdice bdice requested review from mikemhenry, Tobias-Dwyer and vyasr and removed request for a team, mikemhenry and Tobias-Dwyer April 28, 2020 16:25
@bdice
Copy link
Member Author

bdice commented Apr 28, 2020

@vyasr This is what I had in mind for solving #327. Let me know what you think.

@codecov
Copy link

codecov bot commented Apr 28, 2020

Codecov Report

Merging #330 into master will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #330      +/-   ##
==========================================
+ Coverage   76.20%   76.24%   +0.04%     
==========================================
  Files          43       43              
  Lines        7090     7103      +13     
==========================================
+ Hits         5403     5416      +13     
  Misses       1687     1687              
Impacted Files Coverage Δ
signac/contrib/project.py 91.55% <100.00%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36f638d...ffc31c5. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is basically what I had in mind. The solution is pretty simple, but it wasn't clear to me from #327 that we had reached consensus on whether or not this was in scope or not. @csadorf what are your thoughts?

If we do choose to move forward with this, I don't want to support both whitelists and blacklists. Constructing one from another should be easy (something like desired_keys = set.union([list(job.doc.keys()) for job in project].difference(undesired_keys)), and I find APIs that support both are invariably confusing to users no matter how well they are documented because of questions on how the two interact. My personal preference would be for a whitelist, because I think that's the much more likely use case (especially if our focus is on the "large job documents" case), but I could be convinced otherwise.

@bdice
Copy link
Member Author

bdice commented Apr 30, 2020

@vyasr I think adding this feature is in scope. I disagree about supporting only whitelists or only blacklists. I don't think that "constructing one from another should be easy." The one-liner that you wrote is pretty complicated and I think this is a reasonable convenience feature for users. I tried to address the potential for ambiguity on how the two interact by explicitly stating that "excludes override includes" in the docs.

@vyasr
Copy link
Contributor

vyasr commented Apr 30, 2020

Since we're dealing with pandas here, we should use pandas as a guide. All of its read_* functions provide a usecols argument, which is a whitelist. However, this argument is ducktyped to support specifying integer columns, or in the case of wanting a blacklist, a callable that could basically be lambda key: key not in UNDESIRED_KEYS. IMO that would be a cleaner solution for producing an exclusion list. I saw your documentation and I think it is good, I just think a less confusing API choice that doesn't require reading the documentation is greatly preferable if possible.

I'm a bit confused since you were originally of the viewpoint that this was not in scope for signac. Was it was my argument about large files and performance that changed your mind? If so, since Simon was also originally against this I'd be curious what his thoughts are now.

@bdice
Copy link
Member Author

bdice commented May 2, 2020

@vyasr I like your idea to mirror the usecols syntax of read_csv. It would be a whitelist of columns or a callable that returns True for desired keys and False otherwise.

You're correct, I changed my mind about scope after considering the poor performance that would result from adding unwanted job document fields. Job documents are frequently large and there's no need to copy all of that into the dataframe if it's not requested.

@csadorf
Copy link
Contributor

csadorf commented May 2, 2020

Also in favor if mirroring the pandas API as much as reasonable/possible. 👍

@bdice bdice added this to the v1.5.0 milestone May 7, 2020
@vyasr
Copy link
Contributor

vyasr commented May 12, 2020

@bdice I'm taking @csadorf's last comment as tacit approval of moving forward. Since the two of you were the ones who were originally against I think we're good to move forward. Once you make the usecols change I can review again.

On the issue, you also raised the question of exclude_const and nested statepoints. Now that I'm thinking about it, is there a good reason not to unpack nested statepoints? Obviously with heterogeneous schema many things could end up being null values, but I think that's OK since the user is asking for a tabular representation. What if we added an option unpack_nested, and then have exclude_const always imply unpack_nested=True?

@csadorf
Copy link
Contributor

csadorf commented May 13, 2020

Also in favor of having the option to automatically flatten. Working with data that has more than two dimensions in data frames is an absolute pain.

@bdice bdice requested review from csadorf and vyasr May 24, 2020 19:32
@bdice
Copy link
Member Author

bdice commented May 24, 2020

@vyasr @csadorf This is ready for another round of review. I changed to the usecols syntax and added a flatten option for nested dicts.

@bdice bdice changed the title Add includes/excludes to to_dataframe feature. Add usecols and flatten to to_dataframe feature. May 25, 2020
@bdice bdice self-assigned this May 25, 2020
@bdice bdice added the enhancement New feature or request label May 25, 2020
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion for the doc-string, otherwise LGTM! 👍

@bdice
Copy link
Member Author

bdice commented May 28, 2020

@vyasr You can merge this when approved. 👍

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I edited docstrings a tiny bit, otherwise ready to merge.

@vyasr vyasr merged commit 842a07f into master May 28, 2020
@vyasr vyasr deleted the feature/df-includes-excludes branch May 28, 2020 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request subset of columns in JobsCursor.to_dataframe
3 participants