Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add info method to dataset #1176

Merged
merged 6 commits into from
Dec 23, 2016
Merged

add info method to dataset #1176

merged 6 commits into from
Dec 23, 2016

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Dec 21, 2016

I don't know if this is exactly what we want but here's an idea that emulates ncdump -h. I'm sure people will have thoughts on the implementation and output so I'll just throw this first cut up and let people discuss.

closes #1150
fixes #244

xref: #1044, #820

@max-sixty
Copy link
Collaborator

Nice!

Why pass a buffer into that function though? Why not return a string and the user do what they want with it?

@jhamman
Copy link
Member Author

jhamman commented Dec 21, 2016

Good question. I tried to merge the output format from ncdump -h with the api of Pandas.DataFrame.info. I'm certainly not tied to the return format I choose.

@shoyer
Copy link
Member

shoyer commented Dec 21, 2016

I never noticed that DataFrame.info() doesn't actually have a return value. It does seems strange to write to a buffer taken as an argument but I can see one reason why it sort of makes sense -- otherwise you get quotation marks printed around the returned string.

@max-sixty
Copy link
Collaborator

Yes interesting, I didn't know that either.
IIRC there are also methods that are ipython & jupyter specific, that can return repl & notebook outputs

@fmaussion
Copy link
Member

Thanks @jhamman !

About the name: what about just .info(), or .var_info() ? the additional info (which isn't in the repr already) is more about variables, right?

@jhamman
Copy link
Member Author

jhamman commented Dec 21, 2016

I'm also thinking .info() seems like a better name. It would be fairly trivial to add additional argument options to fill in the pandas .info() functionality.

@@ -147,6 +147,10 @@ Enhancements
plots (:issue:`897`). See :ref:`plotting.figsize` for more details.
By `Stephan Hoyer <https://github.com/shoyer>`_ and
`Fabien Maussion <https://github.com/fmaussion>`_.
- New :py:meth:`~Dataset.attr_info` method to summarize ``Dataset`` variables
and attributes. The method produces a stirng output similar to what the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'stirng' typo

@jhamman jhamman changed the title add attr_info method to dataset add info method to dataset Dec 22, 2016
@jhamman
Copy link
Member Author

jhamman commented Dec 22, 2016

Tests are passing on Python 2 now.

@shoyer - how do you feel about info() for the method name?

@shoyer
Copy link
Member

shoyer commented Dec 22, 2016

I'm pretty happy with .info(), as long as that indicates that we aren't wedded to a display matching ncdump -h. In particular displaying the size of each variable in memory in appropriate units could be a nice addition (for later, doesn't need to be in this PR).

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, ju

lines.append('\t:{k} = {v} ;'.format(k=k, v=v))
lines.append('}')

formatting._put_lines(buf, lines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably just use buf.write(u'\n'.join(lines)) here, which will work as long as you ensure all elements in lines are the same (unicode/str) type.

lines.append('xarray.Dataset {')
lines.append('dimensions:')
for name, size in self.dims.items():
lines.append('\t{name} = {size} ;'.format(name=name, size=size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these probably should all be unicode (using u literals), otherwise this will break for non-ASCII characters on Python 2. Take a look at what things look like in formatting.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to add a test for attributes with non-ASCII values.

@@ -12,6 +12,10 @@
import dask.array as da
except ImportError:
pass
try:
from io import StringIO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might put this in in pycompat instead

lines.append(u'\t:{k} = {v} ;'.format(k=k, v=v))
lines.append(u'}')

lines = [ensure_valid_repr(line) for line in lines]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. I think you can write unicode to string buffers, even on Python 2 (at least it works for sys.stdout and StringIO).

Copy link
Member Author

@jhamman jhamman Dec 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, we get:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 363: ordinal not in range(128)

I've always had a tough time getting the string/bytes/unicode stuff straight so I'm open to other ideas here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the issue is that you're using StringIO from cStringIO, which only handles bytes, not unicode. Instead, use StringIO from io, on both Python 2 and 3 (no need for the separate compatibility module even).

@jhamman
Copy link
Member Author

jhamman commented Dec 23, 2016

@shoyer --- green.

@jhamman jhamman merged commit 8192190 into pydata:master Dec 23, 2016
@jhamman jhamman deleted the feature/attr_info branch December 23, 2016 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"ncdump -h" like repr? we need some way to see the attributes on all arrays in a dataset
5 participants