-
-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding notebook example for converting Pandas code to Dask #68 #70
Conversation
Inplace reset_index save_csv
rename, reset_index, convert to date
from pandas to dask
Hi @sephib , thanks for the work here. It's clear and gives several good tips. However, I have two general concerns:
Thoughts? |
Sure, I'll be happy to incorporate any topics that you think represents a more general requirements. Unfortunately I don't have an audience to ask. |
Well, you could ask on a github issue and try to get people to respond there. You might ask also on the gitter channel. You could also review previous github issues to see what themes are common. As with most teaching, I think that most of the work here isn't in preparing the notebook, it's in preparing the content that goes into it. Making example notebooks is hard. |
Hi, |
Hi,
|
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 1.1 Rename" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this one. I don't consider .rename(..., inplace=True)
to be a best practice, and there has been proposals to deprecate inlace in many places in pandas.
I would recommend df = df.rename(columns=...)
, which works for both pandas and dask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a value to show that if we use inplace=True
we get an error
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1.2 Column munipilations \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: manipulations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - there were a few typos...
"source": [ | ||
"# Dask\n", | ||
"ddf = ddf.assign(Time=ddf.index)\n", | ||
"ddf['Time'] = ddf['Time'].dt.time\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why split this across multiple lines? Does the pandas version not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work on a dask
dataframe
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Dask is in a development mode\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is very useful. It will go out of date when the bug is fixed (it actually seems to be fixed already).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
I'll update the cell without the workaround
1. rename 2. meta (including returning Series and DataFrame) 3. dateime conversion 4. dropna 5. read/write files
waiting for your feedback in order to iterate on the notebook |
It'll be a bit of time before I can go through in detail. |
OK, sure. Thx for all your input until now. |
ping @TomAugspurger here, in case this one slipped through |
Still on my list, but it'll be a few more days yet.
…On Tue, May 28, 2019 at 9:26 AM Martin Durant ***@***.***> wrote:
ping @TomAugspurger <https://github.com/TomAugspurger> here, in case this
one slipped through
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70?email_source=notifications&email_token=AAKAOIQH45YYTAHW4CEE533PXU6J7A5CNFSM4HIDAYGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMJT2Y#issuecomment-496540139>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIUPK7JS4VPVCCIXN7DPXU6J7ANCNFSM4HIDAYGA>
.
|
Hi, here is an updated version which I presented @pyconil. https://github.com/sephib/dask_pyconil2019/blob/c660db3ce3e56a9241b49ca13e2163895bab3a94/dask_for_pandas-in_ETL.ipynb. Obviously I need to clean it up from the presentation style. |
@sephib , are you still planning on cleaning up your notebook? |
Yes! |
I haven't looked through in any detail, should all be good once you've responded to @TomAugspurger 's comments, although he may want another look. |
@TomAugspurger I've cleaned up the notebook and amended it (taking account your comments). Thx for all your work |
to enable running the entire notebook without errors
The build reports:
|
(perhaps the path during execution is not what you thought it was) |
The build error is :
I've checked my notebook again and it is running smoothly - not sure what I can do about it... |
Same as #85 ? |
I think it is something related to travis-ci. The notebook looks OK. will try and work on it from a different computer. |
@martindurant well it did the trick. |
I think it may fall under the description of a "flaky" test :) @TomAugspurger , I'm happy with how this looks, so can merge if you have no further comments. |
Thank you, @sephib |
Hi,
Following issue #68 the notebook has the following topics:
2.1 Rename
2.2 Column munipilations
2.3 Drop NA on column
2.4 Reset Index
3 Read/Save files
Please feel free to amend the notebook or suggest additional topics.