Adding notebook example for converting Pandas code to Dask #68 #70

sephib · 2019-04-24T11:39:01Z

Hi,
Following issue #68 the notebook has the following topics:

background
Conceptual shift - from Update to Insert/Delete
2.1 Rename
2.2 Column munipilations
2.3 Drop NA on column
2.4 Reset Index
3 Read/Save files
Group By
Consider using Persist / Debugging

Please feel free to amend the notebook or suggest additional topics.

Inplace reset_index save_csv

rename, reset_index, convert to date

from pandas to dask

mrocklin · 2019-04-24T14:55:26Z

Hi @sephib , thanks for the work here. It's clear and gives several good tips.

However, I have two general concerns:

In many cases these are bugs that could be fixed. I'm not sure I would want to solidify these bugs in user-facing examples, which should probably stay around for a while. Rather, I'd prefer that we just spent time to fix them.
Often you choose situations that I don't see come up often. For example in the section on "Convert index into Time column" I think I've seen this come up in an issue maybe once or twice over several years. From my perspective it's not one of the major differences between the two libraries. My guess is that you chose usability issues that you yourself ran into, which makes sense, but this may not be representative of the general experience. I think that to do this effectively we would need to survey a few people to get a sense of very common differences that trip people up.

Thoughts?

sephib · 2019-04-24T15:16:50Z

Sure,
I never know if it is a bug or incorrect coding...

I'll be happy to incorporate any topics that you think represents a more general requirements. Unfortunately I don't have an audience to ask.
Please point out specific sections that you prefer to remove.

mrocklin · 2019-04-24T15:21:11Z

I'll be happy to incorporate any topics that you think represents a more general requirements. Unfortunately I don't have an audience to ask.
Please point out specific sections that you prefer to remove.

Well, you could ask on a github issue and try to get people to respond there. You might ask also on the gitter channel. You could also review previous github issues to see what themes are common.

As with most teaching, I think that most of the work here isn't in preparing the notebook, it's in preparing the content that goes into it. Making example notebooks is hard.

sephib · 2019-04-28T20:07:08Z

Hi,
I'll try and get some information from data.stackexchange .
will update when I have additional information.

sephib · 2019-05-12T12:10:39Z

Hi,
I reviewed the stack overflow posts with tags of dask and pandas having scores above 5 and came up with some additional issues. Please feel free comment on any issue.

Reading csv
1.1. reading multiple csv files (with ‘*’)
1.2. reading using kwarg - all **kwarg are available such as compression=’gzip’
1.3. reading directly from hdfs
Create dataframe
2.1. Use dd.from_pandas(npartitions n)
Conceptual shift - from Update to Insert/Delete
3.1. Rename
Data manipulations
4.1. As is with Pandas - always try to vectorize
4.2. Working with map_partition vs apply
4.3. Understanding meta
4.4. Using Masks /Where
4.5. Drop NA axis=columns
Understanding index
5.1. Index per partition
5.2. Set/Reset Index
Save files
Group By
Consider using Persist
Debugging
9.1. dd.head() - only uses the first partition (not all partitions are loaded)
9.2. errors due to corrupted DAG

TomAugspurger · 2019-05-14T14:22:42Z

dataframes/03-from-pandas-to-dask.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.1 Rename"


I would remove this one. I don't consider .rename(..., inplace=True) to be a best practice, and there has been proposals to deprecate inlace in many places in pandas.

I would recommend df = df.rename(columns=...), which works for both pandas and dask.

I think there is a value to show that if we use inplace=True we get an error

TomAugspurger · 2019-05-14T14:22:56Z

dataframes/03-from-pandas-to-dask.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1.2 Column munipilations  \n",


Typo: manipulations

Sorry - there were a few typos...

TomAugspurger · 2019-05-14T14:23:41Z

dataframes/03-from-pandas-to-dask.ipynb

+   "source": [
+    "# Dask\n",
+    "ddf = ddf.assign(Time=ddf.index)\n",
+    "ddf['Time'] = ddf['Time'].dt.time\n",


Why split this across multiple lines? Does the pandas version not work?

This doesn't work on a dask dataframe

TomAugspurger · 2019-05-14T14:24:34Z

dataframes/03-from-pandas-to-dask.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dask is in a development mode\n",


I don't think this is very useful. It will go out of date when the bug is fixed (it actually seems to be fixed already).

Great!
I'll update the cell without the workaround

1. rename 2. meta (including returning Series and DataFrame) 3. dateime conversion 4. dropna 5. read/write files

sephib · 2019-05-20T19:34:22Z

waiting for your feedback in order to iterate on the notebook

TomAugspurger · 2019-05-20T19:41:38Z

It'll be a bit of time before I can go through in detail.

sephib · 2019-05-21T15:30:03Z

OK, sure. Thx for all your input until now.

martindurant · 2019-05-28T14:26:38Z

ping @TomAugspurger here, in case this one slipped through

TomAugspurger · 2019-05-28T14:29:29Z

Still on my list, but it'll be a few more days yet.

…

On Tue, May 28, 2019 at 9:26 AM Martin Durant ***@***.***> wrote: ping @TomAugspurger <https://github.com/TomAugspurger> here, in case this one slipped through — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70?email_source=notifications&email_token=AAKAOIQH45YYTAHW4CEE533PXU6J7A5CNFSM4HIDAYGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMJT2Y#issuecomment-496540139>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIUPK7JS4VPVCCIXN7DPXU6J7ANCNFSM4HIDAYGA> .

sephib · 2019-06-04T18:52:04Z

Hi, here is an updated version which I presented @pyconil. https://github.com/sephib/dask_pyconil2019/blob/c660db3ce3e56a9241b49ca13e2163895bab3a94/dask_for_pandas-in_ETL.ipynb. Obviously I need to clean it up from the presentation style.

martindurant · 2019-06-17T19:53:29Z

@sephib , are you still planning on cleaning up your notebook?

sephib · 2019-06-18T14:49:39Z

Yes!
I will remove all the presentations cells.
Do you have any other inputs / issues that you would like me to address?

martindurant · 2019-06-19T18:53:47Z

I haven't looked through in any detail, should all be good once you've responded to @TomAugspurger 's comments, although he may want another look.

sephib · 2019-06-22T21:51:24Z

@TomAugspurger I've cleaned up the notebook and amended it (taking account your comments). Thx for all your work

to enable running the entire notebook without errors

martindurant · 2019-06-22T23:27:05Z

The build reports:

nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
%%time
# Pandas
dir_path = Path(r'data/pd2dd')
concat_df = pd.concat([pd.read_csv(f) 
                       for f in list(dir_path.glob('*.csv'))])
len(concat_df)
------------------
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>
~/miniconda/envs/test/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    226                        keys=keys, levels=levels, names=names,
    227                        verify_integrity=verify_integrity,
--> 228                        copy=copy, sort=sort)
    229     return op.get_result()
    230 
~/miniconda/envs/test/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    260 
    261         if len(objs) == 0:
--> 262             raise ValueError('No objects to concatenate')
    263 
    264         if keys is None:
ValueError: No objects to concatenate

martindurant · 2019-06-22T23:27:40Z

(perhaps the path during execution is not what you thought it was)

sephib · 2019-06-24T16:31:49Z

The build error is :

ChunkedEncodingError: ('Connection broken: OSError("(104, 'ECONNRESET')")', OSError("(104, 'ECONNRESET')"))
ChunkedEncodingError: ('Connection broken: OSError("(104, 'ECONNRESET')")', OSError("(104, 'ECONNRESET')"))
You can ignore this error by setting the following in conf.py:
nbsphinx_allow_errors = True
Notebook error:
CellExecutionError in applications/json-data-on-the-web.ipynb:

events.pluck('spec').frequencies(sort=True).take(20)

I've checked my notebook again and it is running smoothly - not sure what I can do about it...

martindurant · 2019-06-24T16:33:28Z

Same as #85 ?

sephib · 2019-06-24T16:50:12Z

I think it is something related to travis-ci. The notebook looks OK. will try and work on it from a different computer.

sephib · 2019-06-24T19:36:30Z

@martindurant well it did the trick.
@TomAugspurger please don't hesitate to amend the existing topics or suggest additional ideas for the notebook and i'll try to implement them.

martindurant · 2019-06-24T20:05:05Z

I think it may fall under the description of a "flaky" test :)

@TomAugspurger , I'm happy with how this looks, so can merge if you have no further comments.

martindurant · 2019-06-25T15:06:58Z

Thank you, @sephib

sephib and others added 5 commits April 18, 2019 16:58

Add exampel for pd vs. dd for :

322e53f

Inplace reset_index save_csv

Add examples for

9494fd3

rename, reset_index, convert to date

Initial commit for notebook example

c8ab184

from pandas to dask

Delete pd2dd.py

cc0502d

Updated header numbering

d9a2d81

sephib added 2 commits May 13, 2019 01:02

elaborate on read_write csv

42b6723

update dateframe explanation

84db3da

TomAugspurger reviewed May 14, 2019

View reviewed changes

sephib added 3 commits May 15, 2019 00:56

add meta example

9a98f2a

add meta + links to documentations

f525493

Update examples

3e893b4

1. rename 2. meta (including returning Series and DataFrame) 3. dateime conversion 4. dropna 5. read/write files

reordered cells and added debugging example

0c6fb0c

transfered errors into text cells

c379f4a

to enable running the entire notebook without errors

sephib added 2 commits June 24, 2019 18:12

bug fix data path read and write

a627a11

update notebook with outputs for all cells

3451480

rerun notebook from different computer

ce41bca

martindurant merged commit 18dd483 into dask:master Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding notebook example for converting Pandas code to Dask #68 #70

Adding notebook example for converting Pandas code to Dask #68 #70

sephib commented Apr 24, 2019

mrocklin commented Apr 24, 2019

sephib commented Apr 24, 2019

mrocklin commented Apr 24, 2019

sephib commented Apr 28, 2019

sephib commented May 12, 2019 •

edited

Loading

TomAugspurger May 14, 2019

sephib May 19, 2019

TomAugspurger May 14, 2019

sephib May 19, 2019

TomAugspurger May 14, 2019

sephib May 19, 2019

TomAugspurger May 14, 2019

sephib May 19, 2019

sephib commented May 20, 2019

TomAugspurger commented May 20, 2019

sephib commented May 21, 2019

martindurant commented May 28, 2019

TomAugspurger commented May 28, 2019 via email

sephib commented Jun 4, 2019

martindurant commented Jun 17, 2019

sephib commented Jun 18, 2019

martindurant commented Jun 19, 2019

sephib commented Jun 22, 2019 •

edited

Loading

martindurant commented Jun 22, 2019

martindurant commented Jun 22, 2019

sephib commented Jun 24, 2019

events.pluck('spec').frequencies(sort=True).take(20)

martindurant commented Jun 24, 2019

sephib commented Jun 24, 2019

sephib commented Jun 24, 2019

martindurant commented Jun 24, 2019

martindurant commented Jun 25, 2019

Adding notebook example for converting Pandas code to Dask #68 #70

Adding notebook example for converting Pandas code to Dask #68 #70

Conversation

sephib commented Apr 24, 2019

mrocklin commented Apr 24, 2019

sephib commented Apr 24, 2019

mrocklin commented Apr 24, 2019

sephib commented Apr 28, 2019

sephib commented May 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sephib commented May 20, 2019

TomAugspurger commented May 20, 2019

sephib commented May 21, 2019

martindurant commented May 28, 2019

TomAugspurger commented May 28, 2019 via email

sephib commented Jun 4, 2019

martindurant commented Jun 17, 2019

sephib commented Jun 18, 2019

martindurant commented Jun 19, 2019

sephib commented Jun 22, 2019 • edited Loading

martindurant commented Jun 22, 2019

martindurant commented Jun 22, 2019

sephib commented Jun 24, 2019

events.pluck('spec').frequencies(sort=True).take(20)

martindurant commented Jun 24, 2019

sephib commented Jun 24, 2019

sephib commented Jun 24, 2019

martindurant commented Jun 24, 2019

martindurant commented Jun 25, 2019

sephib commented May 12, 2019 •

edited

Loading

sephib commented Jun 22, 2019 •

edited

Loading