Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLT: Cleaner plotting backend API, and unify Series and DataFrame accessors #27009

Merged
merged 33 commits into from
Jul 3, 2019
Merged

PLT: Cleaner plotting backend API, and unify Series and DataFrame accessors #27009

merged 33 commits into from
Jul 3, 2019

Conversation

datapythonista
Copy link
Member

@datapythonista datapythonista commented Jun 23, 2019

The main changes are:

The API that backends need to implement is simply a function plot(data, kind, **kwargs). After testing different approaches (abstract class previously mentioned and others), I think this is what makes things simpler in both ends pandas and the backends. For hvplot hvPlot.__call__ is already what needs to be implemented, and this change will mainly require to delete a bit of code. For pdvega a bit more work will be needed, but this will help remove a significant amount of repetitive code.

Second thing, more in the pandas side, I unify the accessors of Series and DataFrame. There will be minor changes, like unified documentation (which I think it's great, but docstrings will have to be generic), and also Series.plot.scatter will exist (which now doesn't) but will generate an exception if called. Not a big deal I'd say, because now it's already possible to do Series.plot(kind='scatter'), so it'll just make both cases consistent. Same things apply to hexbin.

CC: @pandas-dev/pandas-core @philippjfr @jakevdp

@datapythonista datapythonista added Visualization plotting Clean Needs Discussion Requires discussion from core team before further action labels Jun 23, 2019
@TomAugspurger
Copy link
Contributor

I'm still a bit unclear on how the dispatching to backends will work. IIUC, on master we structure things as DataFrame.plot.__call__(kind, ...), which selects the right method based on the kind. So DataFrame.plot(kind='line') calls DataFrame.plot.line which does its own thing.

Now we're going to change things a bit so that DataFrame.plot.line calls DataFrame.plot.__call__ with kind=line.

Does that have a direct impact on how backends will provide their own implementations?


Plotting methods can also be accessed by calling the accessor as a method
with the ``kind`` argument:
``s.plot(kind='line')`` is equivalent to ``s.plot.line()``
"""
def __init__(self, data):
assert isinstance(data, (pandas.Series, pandas.DataFrame))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this assert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment saying that data had to be Series or DataFrame, I thought that an assert would say the same and also check. No other reason, and happy to get rid of it.

'bootstrap_plot', 'lag_plot', 'parallel_coordinates', 'radviz',
'scatter_matrix', 'register', 'deregister']
def plot(data, kind, **kwargs):
if isinstance(data, pandas.Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABCSeries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense; not sure if copied this, but will change it.

@datapythonista
Copy link
Member Author

Now we're going to change things a bit so that DataFrame.plot.line calls DataFrame.plot.__call__ with kind=line.

Not exactly. I didn't modify that part (besides merging the accessors for Series and DataFrame). In master DataFrame.plot.line calls the accessor line method, and the line method calls __call__. And this didn't change.

The __call__ method was calling plot_frame and plot_series, and those were calling plot. plot was instantiating/calling LinePlot() (LinePlot` was the public part of the backend). This is the part that changed.

Now __call__ will call plot in the backend (plot is the public part of the backend). And plot will instantiate and call LinePlot and the other classes. I think this is much better, since all backends will run some common code first no matter the plot kind.

@codecov
Copy link

codecov bot commented Jun 25, 2019

Codecov Report

Merging #27009 into master will decrease coverage by 50.16%.
The diff coverage is 41.79%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #27009       +/-   ##
===========================================
- Coverage   91.99%   41.83%   -50.17%     
===========================================
  Files         180      180               
  Lines       50774    50673      -101     
===========================================
- Hits        46712    21199    -25513     
- Misses       4062    29474    +25412
Flag Coverage Δ
#multiple ?
#single 41.83% <41.79%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 34.99% <100%> (-62.02%) ⬇️
pandas/core/series.py 45.45% <100%> (-48.22%) ⬇️
pandas/plotting/_matplotlib/__init__.py 56% <26.66%> (-44%) ⬇️
pandas/plotting/_core.py 46.87% <44%> (-42.52%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/gcs.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
... and 136 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0919f2...f00ec30. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 25, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@ad2e98c). Click here to learn what that means.
The diff coverage is 36.84%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #27009   +/-   ##
=========================================
  Coverage          ?   41.82%           
=========================================
  Files             ?      180           
  Lines             ?    50683           
  Branches          ?        0           
=========================================
  Hits              ?    21197           
  Misses            ?    29486           
  Partials          ?        0
Flag Coverage Δ
#single 41.82% <36.84%> (?)
Impacted Files Coverage Δ
pandas/plotting/_core.py 26.66% <ø> (ø)
pandas/plotting/_matplotlib/timeseries.py 12.75% <0%> (ø)
pandas/core/series.py 45.45% <100%> (ø)
pandas/core/frame.py 34.99% <100%> (ø)
pandas/plotting/_matplotlib/__init__.py 60% <33.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad2e98c...2597bc9. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said on the issue, +1 on such a simpler functional API!

figsize=figsize, bins=bins, **kwds)


hist_series.__doc__ = PlotAccessor.hist.__doc__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the original hist_series and hist_frame intact, as the accessor's docstring version is not correct for them (different keywords, different order of the keywords, ...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same for boxplot below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize, good point, thanks!

@datapythonista datapythonista changed the title WIP/PLT: Cleaner plotting backend API, and unify Series and DataFrame accessors PLT: Cleaner plotting backend API, and unify Series and DataFrame accessors Jun 26, 2019
@datapythonista datapythonista removed the Needs Discussion Requires discussion from core team before further action label Jun 26, 2019
Plot a whole dataframe to a bar plot. Each column is assigned a
distinct color, and each row is nested in a group along the
horizontal axis.
def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to move those non-accessor functions down in the file, but for the (already large) diff it might make sense to keep them at the top where they were (those ones are just a copy paste I think), and only do the move in a follow-up PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, but surely better in a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista I actually meant moving them back here, so we can do the actual move in a follow-up
(the diff is right now basically not informative)

@datapythonista
Copy link
Member Author

Still working on the failing tests here. Of the two that are failing locally, the problem can be reproduced with this simple example:

import pandas

df = pandas.DataFrame({'a': [.19, .78, .27],
                       'b': [.62, .77, .80]})
df.plot()
df.plot(y='a')

In master this generates two plots, but in this branch is generating a single plot for both calls. Couldn't figure out why yet.

@jreback jreback added the Blocker Blocking issue or pull request for an upcoming release label Jun 28, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Jun 28, 2019
@pep8speaks
Copy link

pep8speaks commented Jul 2, 2019

Hello @datapythonista! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-03 14:07:47 UTC

future (e.g. `title` which should be useful for any backend).

Currently, all the Matplotlib functions in pandas are accessed through
the selected backend. For example, `pandas.plotting.boxplot` (equivalent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should mention those right now, as there is still some discussion on whether to see them as part of the plotting backend interface or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two lines below I say that This is expected to change. Personally I think it adds more value to have this comment now, and remove it when/if that changes, than just not having this documented, or document the future behavior.

@jorisvandenbossche
Copy link
Member

@datapythonista how are the changes in this PR affecting the misc plots? (since the failing tests seem to be related to those)

@datapythonista
Copy link
Member Author

No idea, I think the test started to fail after some changes to the documentation and comments, or a merge, doesn't make any sense. The changes here shouldn't affect andrews_curves at all. Can't reproduce locally, so I need to debug in the CI.

@datapythonista
Copy link
Member Author

Failing example:

from pandas import DataFrame
from pandas.plotting import parallel_coordinates

df = DataFrame({"feat": [i for i in range(30)],
                "class": [2 for _ in range(10)] +
                         [3 for _ in range(10)] +
                         [1 for _ in range(10)]})
ax = parallel_coordinates(df, 'class', sort_labels=True)
polylines, labels = ax.get_legend_handles_labels()
color_label_tuples = \
    zip([polyline.get_color() for polyline in polylines], labels)
ordered_color_label_tuples = sorted(color_label_tuples,
                                    key=lambda x: x[1])
ordered_color_label_tuples

Locally returns:

[([0.417022004702574, 0.7203244934421581, 0.00011437481734488664], '1'),
 ([0.43599490214200376, 0.025926231827891333, 0.5496624778787091], '2'),
 ([0.5488135039273248, 0.7151893663724195, 0.6027633760716439], '3')]

Which is correct, but in the Mac build in the CI seems to return:

[([0.417022004702574, 0.7203244934421581, 0.00011437481734488664], '1'),
 ([0.43599490214200376, 0.025926231827891333, 0.5496624778787091], '2'),
 ([0.5488135039273248, 0.7151893663724195, 0.6027633760716439], '3'),
 ('#1f77b4', 'None')]

No idea where this mysterious last element comes from. If anyone with a Mac can reproduce it locally that can makes things easy.

The CI was green, and this started to fail after updating comments and the docs, and adding a warning for Series.plot with positional arguments. I'm temporary removing the warning to see if it is causing the error.

@jorisvandenbossche
Copy link
Member

I'm temporary removing the warning to see if it is causing the error.

It didn't help .. (the mac build still has the additional parallel_coordinates error).

I am fine with skipping that test on Mac to get this merged, and try to fix in a follow-up.

@datapythonista
Copy link
Member Author

Very weird, the other commit when this started to fail was just on comments and docs. Will continue tomorrow.

@datapythonista
Copy link
Member Author

Looks like this is the commit that is making the parallel_coordinates test fail: 16d1f88

I'll remove it for now if that's all right.

@jorisvandenbossche
Copy link
Member

Looks like this is the commit that is making the parallel_coordinates test fail: 16d1f88

It might be that putting it in the class fixes it. The setup method of that class imports matplotlib (which might influence if that is imported before pyplot).

@datapythonista
Copy link
Member Author

The warning is tested in another test anyway, it's redundant, so I think it's not a problem to remove that test.

@jreback you requested changes, can you have a look?

@jreback jreback merged commit 903a09c into pandas-dev:master Jul 3, 2019
@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

thanks @datapythonista very nice!

prob want to have a look at built docs / doc-strings to make sure everything renderning correctly.

@TomAugspurger
Copy link
Contributor

Nice work @datapythonista! Anything else you want to get to for the release candidate? I'm hoping to tag it in ~7 hours.

@jorisvandenbossche
Copy link
Member

I just commented on the issue that I think we should decide about the scope of the plotting backend (include misc plotting functions or not) for 0.25.0, but that doesn't need to be done for the RC.

@datapythonista
Copy link
Member Author

will check the docs once they're built.

Thanks Tom, we started to work on hvplot, and I fixed here couple of minor things that were creating problems for the backends. I think everything should be all right on this now, but if there is anything should be very small and can wait after the RC.

The main thing pending on this affecting the release is what Joris just mentioned on what we delegate to the backends. We can discuss that in #26747.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Clean Visualization plotting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants