Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add columns argument to read_gbq #15

Merged
merged 4 commits into from
Oct 13, 2021
Merged

Add columns argument to read_gbq #15

merged 4 commits into from
Oct 13, 2021

Conversation

bnaul
Copy link
Contributor

@bnaul bnaul commented Oct 11, 2021

Seems like a pretty important piece of functionality for a columnar database 🙂 I think maybe it just got dropped in the transition between #1 and #4? Hopefully there's not something I'm missing that makes it more complicated than it was before...

One small note, dd.read_parquet which seems most analogous allows a single string column name in which case you get back a dd.Series instead. That seemed a little too magical to me so I did not include it, if we think aligning with dd.read_parquet is important then it'd be easy to add.

@ncclementi
Copy link
Contributor

ncclementi commented Oct 12, 2021

@bnaul I think it might have ended up stripped out when we removed partitions and partitions field by mistake but this a fair point. I think James pinged you here #4 (comment) and it probably slipped through. We weren't sure if this was a common feature or something specific to your initial application.

Currently, the PRs don't have the full CI running due to a limitation on the Github action that we are using to authenticate (it only runs on push). I'm not sure how to handle this, pinging @jrbourbeau for ideas.

@bnaul
Copy link
Contributor Author

bnaul commented Oct 12, 2021

oops yep totally missed that. the partition_field stuff is definitely a bit weird and I don’t actually know if we even use it anymore (although I can definitely see a lot of use cases for it). columns I think is essential though, for the same reasons as in dd.read_parquet or other columnar formats.

@ncclementi
Copy link
Contributor

@bnaul I just gave you push access to the project, so we can run the full CI. Would you mind pushing all the PRs to separate branches so the CI runs, and then we can merge from there?
I apologize about this, but we haven't found a way of running the CI with the secrets on PRs.

@bnaul
Copy link
Contributor Author

bnaul commented Oct 12, 2021

@ncclementi just pushed, I can't change the fork for this PR but it looks like pushing the same commit to coiled/{branch} is enough for the Action to run and resolve to this PR

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bnaul! Totally agree column selection is something we should support

@@ -82,3 +82,16 @@ def test_read_kwargs(dataset, client):

with pytest.raises(Exception, match="504 Deadline Exceeded"):
ddf.compute()


def test_read_columns(df, dataset, client):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this test looks great. One small suggestions: could we add

assert df.shape[1] > 1

to ensure that the original DataFrame has more than one column? Otherwise, in the future the example DataFrame could be updated to only have a single "name" column and this test would still pass (this is unlikely, but possible)

@@ -2,6 +2,7 @@

from contextlib import contextmanager
from functools import partial
from typing import List
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I'm by no means a type annotation expert, but from reviewing dask/distributed#5328, I think the recommend approach for a list would be to add from __future__ import annotations and then just use the built-in list in the type annotation (i.e. list[str]). This isn't meant to be a blocking comment as what you have here is valid. Just meant as an FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, good call..I'm used to list[] for 3.9 but forgot about the __future__ option

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bnaul! This is in

@jrbourbeau jrbourbeau merged commit 0bcc8a7 into coiled:main Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants