Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Parse Dates from Excel files into DataFrame Properly #463

Closed
mbrv1434 opened this issue Jun 22, 2022 · 2 comments · Fixed by #505
Closed

Can't Parse Dates from Excel files into DataFrame Properly #463

mbrv1434 opened this issue Jun 22, 2022 · 2 comments · Fixed by #505
Assignees
Labels
enhancement New feature or request

Comments

@mbrv1434
Copy link

mbrv1434 commented Jun 22, 2022

Hello, I'm new to using Danfo.js, which is quite impressive. I'm hoping that I'm so new that I simply don't know how to do what I'm trying to do, but it appears that the functionality isn't there (after looking at the documentation). I'll elaborate below.

Is your feature request related to a problem? Please describe.
I'm trying to add functionality to a web app where the user can upload an excel file that bulk uploads data into the backend. To do this, I'm using Danfo and readExcel() to ready a DataFrame, manipulate the data, and then convert the df to JSON to send to the API. One of the columns contains date data, represented as strings (with '/' or '-' characters in it). The problem I'm having is that when the data is read in, Danfo doesn't seem to know how to interpret it, and decides to do so as a float32 (which I'm assuming is the default, see additional context for examples). This ends up making the data completely un-human-readable as a five-digit number instead of a date string, and it may also be losing information though the casting, but I can't tell.

Describe the solution you'd like
I would like to be able to tell readExcel to parse certain columns as certain data types (in this case, string), similar to how you can assign dtype in pandas. I see when I print the dataframes to console that $dtypes has this information stored, but through much trial and error and documentation combing I was unable to figure out how to pass params or otherwise set dtypes. I tried passing an object since you cannot pass params by name in JS (something like readExcel(file, { dtypes: ['int', 'int', string'] })), but that didn't work either. Perhaps this is how it can be implemented.

Describe alternatives you've considered
At the moment I've figured out how to do this through parsing CSV files, which Danfo has no problem with (same data). My users are not tech savvy, so even though this works I'd like to avoid making them export to .csv before they upload, but this can likely be done if it needs to be. I've also considered doing some math and trying to restore the date information from the float data, but I'd have to rack my brain over how the bits are being set and I'm not sure it's worth the effort. If there's an easy formula to recover the string data and reconvert it after it's been interpreted as float, I'm open to that as well. I've tried using df.asType('target', 'string') as well, but it looks like that just casts the float by converting to int then to string (returning a 5-digit number instead of reverting to the original chars).

Additional context
Treat this as an appendix. Here are the data values I'm passing in and what I'm getting out:

in (from excel file, not csv)
milestone_id | project_id | target
-- | -- | --
100 | 2003 | 2022-09-01
103 | 2003 | 2022-09-01
115 | 2003 | 2023-09-01
116 | 2003 | 2023-09-01
188 | 2003 | 2023-09-01
189 | 2003 | 2023-11-10

out (after let df = await dfd.readExcel(file) and then df.print())
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║ │ milestone_id │ project_id │ target ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0 │ 100 │ 2003 │ 44805 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1 │ 103 │ 2003 │ 44805 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2 │ 115 │ 2003 │ 45170 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3 │ 116 │ 2003 │ 45170 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4 │ 188 │ 2003 │ 45170 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 5 │ 189 │ 2003 │ 45240 ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

@mbrv1434 mbrv1434 changed the title Determining how to parse Excel files into DataFrame Can't Parse Dates from Excel files into DataFrame Properly Jun 22, 2022
@sonalkr
Copy link

sonalkr commented Sep 18, 2022

This fix work for me
import { DataFrame } from 'danfojs-node';
import { readFile, utils } from 'xlsx';

const df = new DataFrame(excelToJson(file_path));

excelToJson(file_path: string) {
const ws = readFile(file_path);
const s = ws.Sheets[ws.SheetNames[0]];
return utils.sheet_to_json(s, { defval: null });
}

@risenW
Copy link
Member

risenW commented Sep 25, 2022

Hello, I'm new to using Danfo.js, which is quite impressive. I'm hoping that I'm so new that I simply don't know how to do what I'm trying to do, but it appears that the functionality isn't there (after looking at the documentation). I'll elaborate below.

Is your feature request related to a problem? Please describe. I'm trying to add functionality to a web app where the user can upload an excel file that bulk uploads data into the backend. To do this, I'm using Danfo and readExcel() to ready a DataFrame, manipulate the data, and then convert the df to JSON to send to the API. One of the columns contains date data, represented as strings (with '/' or '-' characters in it). The problem I'm having is that when the data is read in, Danfo doesn't seem to know how to interpret it, and decides to do so as a float32 (which I'm assuming is the default, see additional context for examples). This ends up making the data completely un-human-readable as a five-digit number instead of a date string, and it may also be losing information though the casting, but I can't tell.

Describe the solution you'd like I would like to be able to tell readExcel to parse certain columns as certain data types (in this case, string), similar to how you can assign dtype in pandas. I see when I print the dataframes to console that $dtypes has this information stored, but through much trial and error and documentation combing I was unable to figure out how to pass params or otherwise set dtypes. I tried passing an object since you cannot pass params by name in JS (something like readExcel(file, { dtypes: ['int', 'int', string'] })), but that didn't work either. Perhaps this is how it can be implemented.

Describe alternatives you've considered At the moment I've figured out how to do this through parsing CSV files, which Danfo has no problem with (same data). My users are not tech savvy, so even though this works I'd like to avoid making them export to .csv before they upload, but this can likely be done if it needs to be. I've also considered doing some math and trying to restore the date information from the float data, but I'd have to rack my brain over how the bits are being set and I'm not sure it's worth the effort. If there's an easy formula to recover the string data and reconvert it after it's been interpreted as float, I'm open to that as well. I've tried using df.asType('target', 'string') as well, but it looks like that just casts the float by converting to int then to string (returning a 5-digit number instead of reverting to the original chars).

Additional context Treat this as an appendix. Here are the data values I'm passing in and what I'm getting out:

in (from excel file, not csv)

milestone_id project_id target
100 2003 2022-09-01
103 2003 2022-09-01
115 2003 2023-09-01
116 2003 2023-09-01
188 2003 2023-09-01
189 2003 2023-11-10

out (after let df = await dfd.readExcel(file) and then df.print())
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║ │ milestone_id │ project_id │ target ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0 │ 100 │ 2003 │ 44805 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1 │ 103 │ 2003 │ 44805 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2 │ 115 │ 2003 │ 45170 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3 │ 116 │ 2003 │ 45170 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4 │ 188 │ 2003 │ 45170 ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 5 │ 189 │ 2003 │ 45240 ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

This is because of the default format of date columns when reading excel files with the xlsx package which we use internally. This can be solved by passing a couple of parsingOptions down to xlsx (see https://stackoverflow.com/questions/53163552/format-date-with-sheetjs).

I'm working on a fix so you can pass extra options down to xlsx read/export function. So you will be able to do this:

const df = await readExcel('test.xlsx', {
      parsingOptions: { cellDates: true}
   })

@risenW risenW linked a pull request Sep 25, 2022 that will close this issue
@risenW risenW self-assigned this Sep 25, 2022
@risenW risenW added the enhancement New feature or request label Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants