Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop printing row numbers in show(io, df)? #864

Closed
ghost opened this issue Aug 25, 2015 · 26 comments
Closed

Stop printing row numbers in show(io, df)? #864

ghost opened this issue Aug 25, 2015 · 26 comments
Labels

Comments

@ghost
Copy link

ghost commented Aug 25, 2015

I was confused about the column called "Row" that is printed in all DataFrames since it doesn't keep track of indexes after slicing. For example:

julia> df = DataFrame(x=1:100)

julia> df[20:50,:]
31x1 DataFrame
| Row | x  |
|-----|----|
| 1   | 20 |
| 2   | 21 |
| 3   | 22 |
| 4   | 23 |
| 5   | 24 |
| 6   | 25 |
| 7   | 26 |
| 8   | 27 |
⋮
| 23  | 42 |
| 24  | 43 |
| 25  | 44 |
| 26  | 45 |
| 27  | 46 |
| 28  | 47 |
| 29  | 48 |
| 30  | 49 |
| 31  | 50 |

As this column "Row" is printed, the index starts again with 1 instead of 20. There was an issue ( #187 ) a couple of years ago, but I think the idea was to not rely on indexes and use them only for speed. Since it has been a while, I'd like to know what is the current consensus regarding this issue.

@johnmyleswhite
Copy link
Contributor

I prefer printing the row numbers since you're allowed to index using them at the moment, but I could be convinced to change that if enough of the current committers agree that the row numbers are a problem.

If we ever remove the row numbers (which I'd like to do), this issue would just go away.

I've retitled this issue since the term "index" is misleading.

@johnmyleswhite johnmyleswhite changed the title Status of adding an index Stop printing row numbers in show(io, df)? Aug 25, 2015
@ghost
Copy link
Author

ghost commented Aug 25, 2015

Well, the row numbers are useful for simple observations, but I think they become redundant when the user needs a real index to keep track of some parts of the data:

julia> DataFrame(index=1:100, y=2*(1:100))[50:70,:]
21x2 DataFrame
| Row | index | y   |
|-----|-------|-----|
| 1   | 50    | 100 |
| 2   | 51    | 102 |
| 3   | 52    | 104 |
| 4   | 53    | 106 |
| 5   | 54    | 108 |
| 6   | 55    | 110 |
| 7   | 56    | 112 |
⋮
| 14  | 63    | 126 |
| 15  | 64    | 128 |
| 16  | 65    | 130 |
| 17  | 66    | 132 |
| 18  | 67    | 134 |
| 19  | 68    | 136 |
| 20  | 69    | 138 |
| 21  | 70    | 140 |

For the front-ends, the printed row numbers are very cumbersome. When the user defines a DataFrame with a single column, its HTML representation is actually a two column DataFrame. Usually, this is not a big deal because most front-ends are only displaying the data. However, I noticed the difference when I wanted to receive user-defined DataFrames using the datatables library with virtual scrolling and there was a mismatch in the number of columns.

Furthermore, I think people that uses Python and R frequently want real row numbers. In any case, if we remove the row numbers, what would be offered as a replacement? A real column working as an index?

@johnmyleswhite
Copy link
Contributor

We'll deal with those kind of user interface issues when the appropriate time comes for dealing with them. Right now, making progress on DataFrames is blocked on finalizing the NullableArrays package in time for Julia 0.4.

@ghost
Copy link
Author

ghost commented Aug 25, 2015

That's okay. I just wanted to know if this was a decision already taken or if there was some likelihood to get fixed some of these issues. Thanks.

@mkborregaard
Copy link
Contributor

I think that, for printing, a lot could be gained by just not printing the vertical bars left of the row number, and the identifier 'row'. That would be like

julia> DataFrame(index=1:100, y=2*(1:100))[50:70,:]
21x2 DataFrame
       index | y   |
     |-------|-----|
 1   | 50    | 100 |
 2   | 51    | 102 |
 3   | 52    | 104 |
 4   | 53    | 106 |
 5   | 54    | 108 |
 6   | 55    | 110 |
⋮

 20  | 69    | 138 |
 21  | 70    | 140 |

@tshort
Copy link
Contributor

tshort commented Sep 8, 2015

Good idea, @mkborregaard.

@alyst
Copy link
Contributor

alyst commented Sep 9, 2015

+1 for @mkborregaard suggestion.
In text mode for wide data frames that do not fit one screen row numbers are essential to track the same record in multiple pages, whereas indexing column would be printed only once.

@milktrader
Copy link

+1 for @mkborregaard too

@ghost
Copy link
Author

ghost commented Sep 9, 2015

I also like @mkborregaard's idea but I'm not sure that addresses the use of a real row of numbers (instead of using a printed representation).

@alyst I'm not sure if this is what you mean but pandas keeps the index column even if the number of columns is too large to fit in the screen:

In [6]: X = np.random.random((5, 10))

In [7]: pd.DataFrame(X)
Out[7]: 
          0         1         2         3         4         5         6  \
0  0.788095  0.200569  0.503817  0.951415  0.394964  0.574591  0.095610   
1  0.252333  0.233394  0.400834  0.763205  0.651176  0.308817  0.830079   
2  0.168796  0.637577  0.362691  0.751329  0.260100  0.336644  0.135710   
3  0.028374  0.417096  0.049947  0.969493  0.644621  0.992500  0.796625   
4  0.217272  0.996964  0.822133  0.961850  0.002511  0.327640  0.621592   

          7         8         9  
0  0.531817  0.250808  0.897373  
1  0.034938  0.312996  0.788211  
2  0.293733  0.383446  0.462809  
3  0.115683  0.577399  0.811903  
4  0.446433  0.519582  0.848727  

@alyst
Copy link
Contributor

alyst commented Sep 9, 2015

@rsmith31415 Yes, thank you, that's what I have meant.
One potential solution could be to add frozen_column= parameter to show(io, df). If specified, it should print the contents of this column(s?) (e.g. real record ID) on each page instead of row numbers.

@mkborregaard
Copy link
Contributor

I think potentially there are two issues here - whether to print row numbers, and whether to automatically associate a row index to DataFrames that has special properties when printing. Note that this behaviour in R is not necessarily intuitive:

> iris[14:18, ]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
17          5.4         3.9          1.3         0.4  setosa
18          5.1         3.5          1.4         0.3  setosa

is intuitive, but it is not necessarily intuitive when slicing:

> new_data.frame <- iris[iris$Sepal.Width > 3.5, ]
> head(new_data.frame)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
11          5.4         3.7          1.5         0.2  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
17          5.4         3.9          1.3         0.4  setosa
> new_data.frame[14:18, ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
45           5.1         3.8          1.9         0.4    setosa
47           5.1         3.8          1.6         0.2    setosa
49           5.3         3.7          1.5         0.2    setosa
110          7.2         3.6          6.1         2.5 virginica
118          7.7         3.8          6.7         2.2 virginica

In the last case, the row names only make sense if they actually mean something.

The way I read @alyst 's comment, the suggestion is to allow the DataFrames to have a custom key associated that has special properties when printing. I like that, to me that seems user-friendly and intuitive. Is some of that functionality already in the NamedArray package?

@alyst
Copy link
Contributor

alyst commented Sep 9, 2015

@mkborregaard I've meant only specifying column(s) at print time. Column annotations within a data frame is a different story. It's quite some effort to implement keys, but even for simple annotations we would have to figure out how they should behave under data transformation (e.g. joining or grouping), which might be not so universally intuitive in the end.

@mkborregaard
Copy link
Contributor

OK, I get it now. I think that idea is nice!
Still, I actually also like the row numbers as they are. In R I guess they are actually automatically generated row names and act as such, whereas in julia they are indexes into the DataFrame being printed (even if that has been sliced before printing). OK, that was my 2 cents, I understand it a lot better now I think.

@nalimilan
Copy link
Member

@mkborregaard's proposal looks a lot like what Hadley Wickham's tibble does: https://github.com/tidyverse/tibble/blob/master/README.md

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

duplicate of #592

@quinnj quinnj closed this as completed Sep 7, 2017
@ghost
Copy link
Author

ghost commented Sep 13, 2017

@quinnj I think this is not a duplicate. The purpose was to propose a real index, not to hide the row numbers.

@ararslan ararslan reopened this Sep 13, 2017
@nalimilan
Copy link
Member

If this issue is about adding a concept similar to row names in R or Pandas, I think it can be closed as this probably won't happen. What we could envisage is marking a specific column as being an index like in SQL databases.

@ghost
Copy link
Author

ghost commented Sep 13, 2017

@nalimilan I think your suggestion is very reasonable, but let me point out that it is quite similar (or even equivalent) to using "row names". The main purpose is to have an index, so even if this index is not printed by default, it will still be useful.

@nalimilan
Copy link
Member

It's quite similar, but the advantage is that it wouldn't force you to have a useless column of row names when you don't use it. The problem with row names is that they are often redundant with an ID column which already exists in the data, but since row names don't behave like a standard column they are annoying to work with.

@ghost
Copy link
Author

ghost commented Sep 14, 2017

Sure. I understand your point. It looks like this is a very subjective issue because I often work with datasets that don't have an ID column, so the additional column is very useful. In any case, I think we can agree that an optional index would be a nice feature.

@ararslan
Copy link
Member

For whatever it's worth, I actually disagree that an optional index would be beneficial. In my opinion, if an index or some other set of names is significant in your data, it should be stored as a column of the dataset.

@ghost
Copy link
Author

ghost commented Sep 15, 2017

I think that's because you don't use indexes. Regardless of their different behavior, indexes are also useful to increase speed.

@quinnj
Copy link
Member

quinnj commented Sep 15, 2017

DataFrames is now pretty agnostic to the columns under the hood, so it would be totally possible to create an IndexedColumn type that stored a btree index or whatever. It would take some work to make sure things like join or getindex took advantage, but totally doable and probably composes pretty well w/ the rest of the system now.

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@Moelf
Copy link
Contributor

Moelf commented Sep 23, 2019

may we have something like to_html() (in pandas), I think atm show(io, "text/html", df) is pretty close, but it doesn't seem to dump everything into a html table per-se.

@nalimilan
Copy link
Member

may we have something like to_html() (in pandas), I think atm show(io, "text/html", df) is pretty close, but it doesn't seem to dump everything into a html table per-se.

What do you need that it doesn't do?

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Feb 12, 2020
@bkamins bkamins added display and removed breaking The proposed change is breaking. labels Aug 7, 2020
@bkamins
Copy link
Member

bkamins commented Nov 8, 2020

This is largely resolved with PrettyTables.jl backend now.

@bkamins bkamins closed this as completed Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants