Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading data with Excel drops columns with no values. #822

Closed
vortexkd opened this issue Jul 14, 2020 · 7 comments · Fixed by #909
Closed

Reading data with Excel drops columns with no values. #822

vortexkd opened this issue Jul 14, 2020 · 7 comments · Fixed by #909
Labels

Comments

@vortexkd
Copy link

When reading an excel with data like this:

some_col, my_empty_col, other_col
1,,"hi"

"my_empty_col" does not exist on the resulting Table.

For reference the Csv reader seems to work fine.

I think this has something to do with the way columns are added in XlsxReader, the createColumn is inside the cell != null check, so it never gets added?

if (cell != null) {
          if (column == null) {
            column = createColumn(headerNames.get(colNum), cell);  // looks like it runs only if cell != null
            columns.set(colNum, column);
            while (column.size() < rowNum - tableArea.startRow) {
              column.appendMissing();
            }
          }

Steps to reproduce:

Table csvTable = Table.read().usingOptions(
                XlsxReadOptions.builder(
                        "my_file.csv")
        );
csvTable.column("my_empty_column"); // works fine

Table excelTable = Table.read().usingOptions(
                XlsxReadOptions.builder(
                        "my_file.xlsx")
        );
excelTable.column("my_empty_column"); // throws error.
@lwhite1
Copy link
Collaborator

lwhite1 commented Jul 14, 2020

Hi vortexkd,

Is a column with no values useful to you?

If we read it, it would have to be as a StringColumn, which would likely lead to a bug report from another user who intended their empty column to be a column of numbers or dates.

See also #815, #812

@vortexkd
Copy link
Author

vortexkd commented Jul 15, 2020

My particular use case:
I'm reading the file, processing the data (adding a bunch of things) and outputting the file again.
And it seems really weird for a column to just magically disappear, especially if it isn't always empty and as a result only disappears sometimes.

I understand that typing the column would then become an issue - but it seems that converting the excel file to Csv and then reading it solves the missing column issue - so if dropping empty columns is the stance you want to take on it as a project, then the CSV reader is working strangely (because it keeps empty columns).

-> Either way, it seems like one could reasonably expect the behaviour to be the same if the file is a CSV or an Excel file.

Maybe this sort of thing should have a setting that can be passed into the XlsxReadOptions / CsvReadOptions?

I guess this is similar to #812 but for excel

Edit:
I understand the bug that this would lead to with a simplistic solution (eg, the first row of a column is empty and as a result a numerical column becomes a string column - and panic ensues)
Maybe in the event that columns are created of the wrong type, or are missing etc using the default strategy, the user could send in a collection of "column_name": type_enum to specify expected behaviour?

That would be an enhancement

@lujop
Copy link
Contributor

lujop commented Jan 25, 2021

I think that should be useful to have that configurable in readOptions.
To be able to solve this and #815 and #812 I think that something like this should work:

preserveColumnWithMissingData(String columnName, ColumnType, defaultValue)

@radiocold
Copy link

I have they same problem, this will be fixed??? I need don't remove column with data empty.

@lujop
Copy link
Contributor

lujop commented Mar 4, 2021

I will be happy to provide a PR if something like preserveColumnWithMissingData can be accepted.

@ArslanaWu
Copy link
Contributor

Hi! We are a group of students ( Zhengxin, Jiashu, Zunyao, Wendi and me) who would like to contribute to open source projects. Can we work on this issue? We want to let user specify the type of empty column in advance, otherwise it will not be created as before.

@lwhite1
Copy link
Collaborator

lwhite1 commented Apr 18, 2021 via email

lujop added a commit to lujop/tablesaw that referenced this issue Apr 26, 2021
lujop added a commit to lujop/tablesaw that referenced this issue Apr 27, 2021
lujop added a commit to lujop/tablesaw that referenced this issue Apr 30, 2021
@lwhite1 lwhite1 added the excel label May 4, 2021
lwhite1 pushed a commit that referenced this issue May 9, 2021
…eaders (#909)

* Fix #822 and #815

* Apply PR requestes changes

* Changes asked in PR

* Rename variable for better code readibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants