-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Define API for pandas plotting backends #26747
Comments
I think keep things like autocorrelation out of the swappable backend API. I think we’ve left things like df.boxplot and hist around because they have slightly different behavior than the .plot API. I wouldn’t recommend making them part of the backend API. |
Here’s my start on a proposed backend API from a few months ago: TomAugspurger@b07aba2 |
I think it's worth mentioning that at least May be if we don't want to force all backends to implement those, we can check if the selected backend implements them, and default to the matplotlib plots? I assumed |
IMO, the main value of this option is the `.plot` namespace.
If users want hvplot's Andrew's curve plot, they should import the function
from hvplot and pass the dataframe there.
…On Sun, Jun 9, 2019 at 7:17 AM Marc Garcia ***@***.***> wrote:
I think it's worth mentioning that at least hvplot (didn't check the
rest) does already provide the functions like andrews_curves,
scatter_matrix, lag_plot,...
May be if we don't want to force all backends to implement those, we can
check if the selected backend implements them, and default to the
matplotlib plots?
I assumed boxplot and hist behaved exactly the same, but just had
shortcuts Series.hist() for Series.plot.hist(). The "shortcut" shows the
plot grid, but other than that I haven't seen any difference.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#26747?email_source=notifications&email_token=AAKAOIRLJHBMXMXKK2IG2NDPZTYFPA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXII77Y#issuecomment-500207615>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOISDHL6H7PVOOJAQXELPZTYFPANCNFSM4HWIMEKQ>
.
|
I think that makes sense, but if we do that, I think we should move them to @TomAugspurger I need to check in more detail, but I think the API you implemented in TomAugspurger@b07aba2 is the one that makes more sense. I'll work on it once I finish #26753. I'll also experiment on whether it's feasible to move |
What's the intention here regarding extra kwargs passed to plotting functions? Should additional backends attempt to duplicate the functionality of all matplotlib-style plot customizations, or should they allow keywords to be passed that correspond to those used by the particular backend? The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back). The second option is not as nice from the perspective of interchangeability, but would allow other backends to be added with a more reasonable set of expectations. |
I think that's up to the backend on what they do with them. Achieving 100%
compatibility across backends isn't really feasible,
since the return type isn't going to be a matplotlib Axes anymore. And if
we aren't compatible on the return type, I don't think backends
should bend over backwards to try to handle every possible keyword argument.
So I think pandas should document that `**kwargs` will be passed through to
the underlying plotting engine, and they can do whatever they please with
them.
…On Mon, Jun 10, 2019 at 10:42 AM Jake Vanderplas ***@***.***> wrote:
What's the intention here regarding extra kwargs passed to plotting
functions? Should additional backends attempt to duplicate the
functionality of all matplotlib-style plot customizations, or should they
allow keywords to be passed that correspond to those used by the particular
backend?
The first option would be nice in theory, but would require every
non-matplotlib plotting backend to essentially implement its own matplotlib
conversion layer with a long tail of incompatibilities that would
essentially never be complete (speaking from experience as someone who
tried to create mpld3 some years back).
The second option is not as nice from the perspective of
interchangeability, but would allow other backends to be added with a more
reasonable set of expectations.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26747?email_source=notifications&email_token=AAKAOIS3IBV4XSSY7BPSCF3PZZY5LA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXKH4AY#issuecomment-500465155>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIQ3GYOGAPUZ4LSNK2DPZZY5LANCNFSM4HWIMEKQ>
.
|
I'm sorry if this is a stupid question, but If you define a plotting "API" which is basically a group of canned plots, wouldn't every backend produce more or less the same output? what new capability is this meant to enable? something like a pandas to vega exporter perhaps? |
I don't think it's correct to say that every backend produces more or less the same output. For example, matplotlib is really good at static charts, but not great at producing portable interactive charts. On the other hand, bokeh, altair, et al. are great for interactive charts, but aren't quite as mature as matplotlib for static charts. Being able to produce both with the same API would be a big win. |
and also pins Matplotlib down even more than we already are API wise. I think it makes sense for pandas to declare what style knobs it wants to expose and expect the backend implementations to sort out what that means. This may mean not blindly passing |
Thanks @jakevdp, yes, supporting interactive charts is a good goal. Before things go too far down this particular avenue, here's an alternative solution. Instead of proclaiming the pandas plotting API to now be a specification, and asking viz packages to implement it specifically, why not generate an intermediate representation (like a vega JSON file) of the plot, and encourage backends to target that as their input. Advantages include:
Vega/Vega-lite, as a modern, established, open, and JSON-based viz specification language, several man-years put it into its design and implementation, and existing tools built around it, seems like it was created expressly for this purpose. (just please don't). You know, |
At least three packages already implement the API. All pandas needs to do is offer an option for changing the backend and document its use, which seems like a good bang for our buck.
… On Jun 15, 2019, at 16:28, pilkibun ***@***.***> wrote:
For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.
Thanks @jakevdp, yes, supporting interactive charts is a good goal.
Before things go too far down this particular avenue, here's an alternative solution.
Instead of proclaiming the pandas plotting API to now be a specification, and asking viz packages to implement it specifically, why not generate an intermediate representation (like a vega JSON file) of the plot, and encourage backends to target that as their input.
Advantages include:
Not being tied to the expressive power of a reified pandas API, which wasn't designed as a specification.
The work done by plotting packages to support pandas, becomes available to other pydata packages which generate IR.
Promoting a common language for interchange visualization in the pydata space
Which makes new tool more powerful because more widely applicable
Which makes the effort of writing them more reasonable. Basically, improved incentives.
Vega/Vega-lite, as a modern, established, open, and JSON-based viz specification language, several man-years put it into its design and implementation, and existing tools built around it, seems like it was created expressly for this purpose. (just please don't).
You know, frontend->IR->backend, like compilers are designed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
We now merged #26753, and the plotting backend can be changed from pandas. When we split the matplotlib code we left the But I see that what backends did was to reimplement those classes. So, currently we expect the backends to have one class per plot (e.g. What I think makes sense, at least as a first version, is that we implement the API as If that makes sense for everyone, I'll start by creating the abstract class and adapting the matplotlib backend we have in pandas, and once this is done, we adapt Thoughts? |
I think that on balance this approach will be cleaner. I can't speak to other plotting backends but at least in hvPlot different plot methods share quite a bit of code, e.g. |
Just to make sure I understand, when you say I think I'm +1 on letting backends define plot types not in pandas. But I won't probably implement it right now. We're planning to release pandas in around one week. And I think this will require a bit more thinking than blindly calling the methods of backends if user provides |
Yes, that's right. More concretely I'd prefer not to have to do this kind of thing:
Sorry if that was not clear. |
Very much in favor of a simpler API that is publicly exposed as the single function instead of the classes as now proposed in #27009. General question/remark on how the backend option now works. Assume I am the EDIT: I see that actually something similar was discussed in the PR #26753 |
If we make the decision that pandas doesn't know/limit which backends can be used (which I'm strongly in favor of making), we need to decide on how/what to call in the backends. What it's been implemented and proposed in the PR I'm working on is that the option We can consider other options, like if the option is Another discussion is if we want to force the users to import the backends manually: import pandas
import hvplot
pandas.Series([1, 2, 3]).plot() If we do that, the modules can register themselves, they can also register aliases (so And while it could be nice to do
|
@datapythonista thanks for the detailed answer. I am fine with keeping it now as is for the initial release (possibility for alias can always be added later).
+1, I would also not expose all the additional plotting functions through the backend. But about moving them to |
If we use entrypoints to register extensions, then this does not have to be the case: having the package installed on the system will register the entrypoint and make it visible to pandas. For example, this is what Altair uses to detect various renderers that the user might have installed. |
Also, for what it's worth, once this goes in I think I'd probably deprecate pdvega and move the relevant code over to a new package named |
Just to explain a bit why things are the way they are now. It's relevant because I'm not quite sure how to implement the changes you propose, or not exposing things in general. Not saying here that it can't be done in a different way, it's just to enrich the discussion. The first decision was to move all the code using matplotlib to a separate module ( Everything that was public in The public API for the user has no change at all, users still have the same methods and functions available. We're not exposing anything new. How I understand things, if we decide that an existing plot like If we want to be nice with backend developers and not force them to implement plots that may not be so mainstream (I guess that's one of the reasonings?), may be we can default to the matplotlib backend anything that is missing in the selected backend? About delegating any unknown kind of plot to the backend, I'm -1 on doing it right now. Surely it can make sense eventually. But I think having several plot kinds documented in pandas, and having extra ones that the we don't document, feels a bit hacky. I think it can wait for the next version, after we have feedback on how having different backends work for users, and we have more time to discuss and analyze in detail. |
I don't think we would be making the user's life harder. Instead of importing it from pandas.plotting, if they want a hvplot's version, they can simply import it from there. Which is something not possible for the DataFrame.plot method, as that is defined on the object. For me that is the main reason for the plotting backend.
For me it is not about being nice or that implementing everything would be required (it is totally fine if a backend does not support all plotting types, IMO), but rather an unnecessary expansion of the plotting backend API, which also ties ourselves to it. Any other opinions about this? |
Agreed with @jorisvandenbossche. Just to make sure this isn't lost, I think @jakevdp's suggestion to use setuptool's entry points is worth considering to solve the import order registration issue: #26747 (comment) |
@jorisvandenbossche how would you change that in the code? Instead of getting the backend defined in the settings for those methods, get the matplotlib backend? I think this is wrong conceptually, but I'm ok with it if there is agreement. Anything that reverts the decoupling of the matplotlib code from the rest I'm -1. Since you mention that in a pandas from scratch we wouldn't include those plots, should we deprecate them? I'm +1 on moving all the plots that are not methods of |
i would deprecate the non standard plots in pandas |
Joris is offline for a bit. I think when we’ve discussed this in the past, his and my position on theses is to just leave them untouched until they become a maintenance burden. |
Just so we are in the same page, this is a summary of what we have, and my understanding of the state of the discussion: Used as methods of
Other plots (under discussion whether they should be deprecated, delegated to the matplotlib backend, or delegated to the selected backend):
Other public stuff in
For the For the converters and the other stuff, what we have now is surely not correct, since
|
How do you find the "other plots" to be a maintenance burden? Looking at the history for the "misc" plots: https://github.com/pandas-dev/pandas/commits/0.24.x/pandas/plotting/_misc.py, we have ~10-15 commits since 2017. The majority are global cleanups applied to the entire codebase (so a small marginal burden). I only see 1-2 commits changing docs, and no commits changing functionality.
I don't think this would make sense. There are matplotlib-specific converters that we've written for matplotlib. Other backends won't have them. It probably shouldn't be part of the backend API. |
I didn't mean those plots are a burden because of the amount of maintenance we've got in the last months of years, but because of the problem that they suppose now in having a consistent and intuitive API for users, and a good modular code design for us. Regarding the converters, I don't know if backend authors may want to implement the equivalent of those for matplotlib in some cases. But doesn't seem a problem if they don't, and those functions do nothing for some or all of the other backends. I'm also ok with option 2, but I don't find it as neat. |
They're already somewhat inconsistent with DataFrame.plot, though. The name "misc" implies that :) Does having a swappable backend make that any worse? To the extent that it's worth the churn on user code? I don't think so.
I don't think so. The point of those converters is to teach matplotlib about pandas objects. Libraries implementing the backend won't have that problem, since they already depend on pandas. |
Personally I think about it mainly in terms of managing complexity. Having a standard plotting API that is delegated to the backend via a single API is easy to understand, and to maintain. Users and maintainers just need to learn that there is a Having in the backend a set of heterogeneous plots, that besides not following the same API, use a backend, but not the one selected for the other plots, but the Matplotlib one, adds too much complexity for everyone IMHO. And the cost of moving them seems small to me, my guess is that not a big proportion of our users even know about those plots. And for the ones who do, they'll just need to install an extra conda package and use To me seems a lot to win, at a small cost, but of course it's just an opinion. |
Can we document that the swappable backend is just for Series/DataFrame.plot? That seems like a pretty simple rule. |
Feels like a hack that adds unnecessary complexity to me; I don't think explaining it in the documentation makes it less counter-intuitive. But anyway, not a big deal. If that's the preferred option, this is how I'd implement it, at least the increase in code complexity is minimal: #27432 |
Looking more closely at this now: if I understand correctly, the way that the plotting backend will be set is using:
My understanding, then, is that if I want to make the following work:
then I will need the top-level altair package to define all the functions in https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py. I would prefer not to pollute Altair's top-level namespace with all these additional APIs that are not meant to actually be used by Altair users. In fact, I would prefer for altair's pandas extension to live in a separate package, so it's not tied to the release cadence of Altair itself. If I understand correctly, this means that there's no way for me to make pandas/pandas/plotting/_core.py Lines 1550 to 1551 in f1b9fc1
If so, I would strongly advise rethinking the means by which this API is exposed in third-party packages. My suggested solution would be to adopt an entrypoint-based framework that would let me, for example, create a package like |
Agreed. I think entry points are the way to go. I'll prototype something.
…On Fri, Jul 19, 2019 at 1:16 PM Jake Vanderplas ***@***.***> wrote:
Looking more closely at this now: if I understand correctly, the way that
the plotting backend will be set is using:
pd.set_option('plotting.backend', 'name_of_module')
My understanding, then, is that if I want to make the following work:
pd.set_option('plotting.backend', 'altair')
then I will need the top-level altair package to define all the functions
in
https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py.
I would prefer not to pollute Altair's top-level namespace with all these
additional APIs. In fact, I would prefer for altair's pandas extension to
live in a separate package, so it's not tied to the release cadence of
Altair itself.
If I understand correctly, this means that there's no way for me to make pd.set_option('plotting.backend',
'altair') work correctly without hard-coding the altair package in pandas
the way matplotlib is currently hard-coded, is that correct?
If so, I would strongly advise rethinking how this is enabled by
third-party packages. In particular, adopting an entrypoint-based framework
would let me create a package like altair_pandas that registers the altair
entrypoint. Otherwise users will forever be confused that pd.set_option('plotting.backend',
'altair') doesn't do what they expect.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26747?email_source=notifications&email_token=AAKAOITQM7HH5X4SZ4IAPS3QAIAIBA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2ML5OQ#issuecomment-513326778>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOISFLHDGXLGQ3PUMNLDQAIAIBANCNFSM4HWIMEKQ>
.
|
There was a point in time where what you say was mostly correct, but that's not the case anymore. If you want If creating the You can surely change the option yourself once users do an One last thing is to consider that we could possibly have more than one pandas backend implemented for altair (or any other visualization library). So, for me, that the name of the backend is not |
Here's an entry-points based implementation diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py
index 0610780ed..c8ac12901 100644
--- a/pandas/plotting/_core.py
+++ b/pandas/plotting/_core.py
@@ -1532,8 +1532,10 @@ class PlotAccessor(PandasObject):
return self(kind="hexbin", x=x, y=y, C=C, **kwargs)
+_backends = {}
-def _get_plot_backend(backend=None):
+
+def _get_plot_backend(backend="matplotlib"):
"""
Return the plotting backend to use (e.g. `pandas.plotting._matplotlib`).
@@ -1546,7 +1548,14 @@ def _get_plot_backend(backend=None):
The backend is imported lazily, as matplotlib is a soft dependency, and
pandas can be used without it being installed.
"""
- backend_str = backend or pandas.get_option("plotting.backend")
- if backend_str == "matplotlib":
- backend_str = "pandas.plotting._matplotlib"
- return importlib.import_module(backend_str)
+ import pkg_resources # slow import. Delay
+ if backend in _backends:
+ return _backends[backend]
+
+ for entry_point in pkg_resources.iter_entry_points("pandas_plotting_backends"):
+ _backends[entry_point.name] = entry_point.load()
+
+ try:
+ return _backends[backend]
+ except KeyError:
+ raise ValueError("No backend {}".format(backend))
diff --git a/setup.py b/setup.py
index 53e12da53..d2c6b18b8 100755
--- a/setup.py
+++ b/setup.py
@@ -830,5 +830,10 @@ setup(
"hypothesis>=3.58",
]
},
+ entry_points={
+ "pandas_plotting_backends": [
+ "matplotlib = pandas:plotting._matplotlib",
+ ],
+ },
**setuptools_kwargs
) I think it's quite nice. 3rd party packages will modify their setup.py (or pyproject.toml) to include something like
I like that it breaks the tight coupling between naming and implementation. |
I didn't work with entry points, are them like a global registry of the Python environment? Being new to them I don't love the idea, but I guess that would be a reasonable way to do it then. I'd still like to have both options, so if the user does |
I haven't used them either, but I think that's the idea. From what I understand, they're from setuptools (but packages like flit hook into them?). So they aren't part of the standard library, but setuptools is what everyone uses anyway.
Falling back to |
Libraries, including pandas, register backends via entrypoints. xref pandas-dev#26747
In #26414 we splitted the pandas plotting module into a general plotting framework able to call different backends and the current matplotlib backends. The idea is that other backends can be implemented in a simpler way, and be used with a common API by pandas users.
The API defined by the current matplotlib backend includes the objects listed next, but this API can probably be simplified. Here is the list with questions/proposals:
Non-controversial methods to keep in the API (They provide the
Series.plot(kind='line')
... functionality):Plotting functions provided in pandas (e.g.
pandas.plotting.andrews_curves(df)
)Should those be part of the API and other backends should also implement them? Would it make sense to convert to the format
.plot
(e.g.DataFrame.plot(kind='autocorrelation')
...)? Does it make sense to keep out of the API, or move to a third-party module?Redundant methods that can possibly be removed:
In the case of
boxplot
, we currently have several ways of generating a plot (calling mainly the same code):DataFrame.plot.boxplot()
DataFrame.plot(kind='box')
DataFrame.boxplot()
pandas.plotting.boxplot(df)
Personally, I'd deprecate number 4, and for number 3, deprecate or at least not require a separate
boxplot_frame
method in the backend, but try to reuseBoxPlot
(for number 3 comments, same applies tohist
).For
boxplot_frame_groupby
, didn't check in detail, but not sure ifBoxPlot
could be reused for this?Functions to register converters:
Do those make sense for other backends?
Deprecated in pandas 0.23, to be removed:
To see what each of these functions do in practise, it may be useful this notebook by @liirusuk: https://github.com/python-sprints/pandas_plotting_library/blob/master/AllPlottingExamples.ipynb
CC: @pandas-dev/pandas-core @tacaswell, @jakevdp, @philippjfr, @PatrikHlobil
The text was updated successfully, but these errors were encountered: