Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aliases for column names #11723

Open
bbirand opened this issue Nov 30, 2015 · 12 comments
Open

Aliases for column names #11723

bbirand opened this issue Nov 30, 2015 · 12 comments
Assignees
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action

Comments

@bbirand
Copy link

bbirand commented Nov 30, 2015

When I work with Pandas DataFrames, I prefer to keep the full column names for clarity. So when I print out the head, or use describe, I get a meaningful table. However, this also means I have column names like "Time of Sale" that become annoying to type out.

A nice compromise seems like it would be to have short "aliases" for column names. For instance, I can define the tos average for the above, perhaps like so:

df = pd.read_csv(...)
df.set_alias({'Time of Sale' : 'tos'})

Then, the __get_attribute__ method can look up aliases in addition to column names, so I can refer to that column simply as df.tos. But for all other purposes, the columns name is still the descriptive full name.

Would this make sense?

@jreback
Copy link
Contributor

jreback commented Nov 30, 2015

related to #10349

I suppose this is possible. This would be fairly easy to implement, but would require a good number of test cases to ensure its propogating correctly (e.g. this is analagous to the name attribute for Indexes in that it propogates when appropriate).

Further would require an audit of the indexing code for it to be a synonymous application (e.g. you can use the alias where you could use the actual label).

So while this is interesting, it would require a pull-request from the community to jump start it.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design Difficulty Advanced labels Nov 30, 2015
@jreback jreback added this to the Someday milestone Nov 30, 2015
@bbirand
Copy link
Author

bbirand commented Dec 3, 2015

I'll have a go at this when I get a chance. It also occurred to me that these aliases may be useful when dealing with DataFrame.query() methods. Based on my trials, this function does not work when there are spaces on the column names (please correct me if I'm wrong, I couldn't get them to work).

@jreback
Copy link
Contributor

jreback commented Dec 3, 2015

no .query processes strings so you cannot use strings, this is noted in the documentation.

@shoyer
Copy link
Member

shoyer commented Dec 3, 2015

I'm not a big fan of including this feature in pandas itself, because it would make the pandas data model significantly more complex. Maybe this could be implemented in some sort of add-on package that wraps pandas DataFrames? Another option would be a DataFrame subclass.

@ijstokes
Copy link

ijstokes commented Sep 7, 2016

There are certainly risks that could be introduced from adding aliasing, but wouldn't a straightforward strategy be to augment the logic in get_attribute() that, presumably, already does some form of this. So if an alias dictionary existed on the DataFrame then it would try again provided the requested attribute (not found using "the usual mechanism") had a key entry in the alias dictionary. E.g.

# 1. works today:
df['Time of Sale']

# 2. fails today:
df.time_of_sale

# 3. could work in the future:
df.alias = dict(time_of_sale='Time of Sale')
df.time_of_sale

Or maybe I misunderstand and 2. is already possible today. If so, could someone point me in the right direction toward documentation? I too would find this quite useful.

@bbirand
Copy link
Author

bbirand commented Sep 7, 2016

Or maybe I misunderstand and 2. is already possible today. If so, could someone point me in the right direction toward documentation? I too would find this quite useful.

In order to do 2., you would have to rename the column, possibly using http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

And then, when you'd like to print or plot it, you'd rename it back to the original version.

I too think this would still be a good addition for interactive work. To make things even more interesting, I would alias "Time of Sale" to "tos", so I can work with the data as df.tos, but then see the full name when plotted.

@KeithWM
Copy link

KeithWM commented Sep 14, 2018

I'd also like to see such a feature. For me the favourite use case would be to have nice, legible axes labels (with units) in seaborn plots. I know one can manually set the axis labels, but I find this error prone, too verbose and it leads to code duplication.

If you ask me, the easier way would be to keep the current name in the role of an alias as @bbirand proposes, and to add some other field for a longer name, which can default to the "normal" name if none is explicitly given.

@ajeet2808
Copy link

Any update on this feature?

@obarak
Copy link

obarak commented Oct 1, 2020

we need and equivalent for "SELECT max(column1)*0.25+ 0.44*sum(column2) as 'calculated_column' from TABLE group by column3,column5"

@luisfelipe18 - Actually, for aggregation you already have aliasing in Pandas, see here (I'd recommend reading through the entire post).

The current issue refers to aliasing existing columns, regardless of aggregation.

@TomAugspurger
Copy link
Contributor

IMO, we shouldn't use this in pandas itself. Indexing is complicated enough without aliases.

We'd be better served by adopting / defining a convention (similar to how xarray uses CF conventions) for mapping column names to descriptive names. These could be stored in the DataFrame.attrs dict which (should) propagate through operations. Then downstream libraries (e.g. plotting libraries, libraries for generating tables for presentation) can use the descriptive names.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed API Design labels Apr 21, 2021
@adavidzh
Copy link

I'd like to echo @KeithWM's point about there being a "long form" for a column's contents, i.e. something that seaborn can use in axis labels. This would not necessarily be an alias, but rather a human-readable form with full description (e.g. involving LaTeX) and units. This sort of thing comes up over and over again in making scientific plots; I found this thread because I want a column named engine_data_total_throughput and would like the axis label to be $\sum$ data throughput [Gb/s] without having to specify it over and over again when plotting.

I understand that this "long form" is not a general scheme for creating aliases (that is a many-to-one correspondence) and it could make sense to understand what is the main use case and, perhaps, have a new thread.

@Krzmbrzl
Copy link

Krzmbrzl commented Jun 2, 2022

If I understood this issue correctly, it is about the intention to retain the (potentially long and verbose) original column names for everything but for accessing the columns in code.

As I see it, we can already get exactly that without any modification to pandas at all: Just define some constants and then use those to access your columns in your code:

class Columns:
    colA = "My tediously long name for column A"
    colB = "Yet another long column name"
    colC = "Some column with $\emph{special}$ symbols in it"

df = pd.read_csv(...)

print(df[Columns.colA])

Using a separate class to create a namespace for the column constants is of course optional and you can omit it if you prefer.

If I did not miss anything, this seems to fit all scenarios in which one would want to use aliases, unless you are trying to alias some columns to allow something like column-duck-typing. But I guess that would probably only get messy really quickly anyway 🤔

@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests