Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[skip ci] Fix Broken Link in NYC Taxi Notebook #616

Merged
merged 3 commits into from
Aug 4, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 65 additions & 40 deletions notebooks/nyc_taxi_years_correlation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,9 @@
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a big slowdown? Running on a different GPU?


Reply via ReviewNB

Copy link
Contributor Author

@isVoid isVoid Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taureandyernv , are you aware how was the notebook run prior to this change? On what machine, with/without memory pool etc.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're saving the parquet file as a csv :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taureandyernv do you have an answer to the question about what hardware it was run on, memory pool etc? We need to make sure this is not a real regression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we should always store this information in the notebook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acutally...this is about right. This should take 30ish seconds on a tesla based hardware, 3.5 minutes on quadro. @isVoid, what slowdown are you talking about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the diff in this PR, you will see the slowdown I am talking about.

The notebook as checked in stores the time taken by the previous run. We should document in the notebook what hardware, RAPIDS and CUDA version, etc. it was run on. Otherwise when people run and they see a slower time they are likely to be confused or surprised.

None of this "this is about right" guesswork. Let's be precise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the slowdown:

old:

      "CPU times: user 11.2 s, sys: 6.27 s, total: 17.5 s\n",
      "Wall time: 17.7 s\n"

new

      "CPU times: user 22.6 s, sys: 14.6 s, total: 37.2 s\n",
      "Wall time: 37.2 s\n"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is interesting and I've honestly don't recall seeing it run that fast. However, we can't yet compare these numbers and say there's a regression, but we can strongly suspect it. When we built this notebook for the 0.14 release, I did my work on RTX8000s (here is the commit with the ~3.5 minute run), ran some quick tests on the GV100s to see how my time could be so different, and only reran it on GV100s with this update. @thomcom used a different machine that was consistently faster with the PIP algorithm. It was cuSpatial's PIP that helped me realize the depth of the "by algorithm" performance discrepancies that classes of cards in the same arch family could have. Looking back through the branches, the notebook hasn't actually been reran since its merge and the CI jenkins jobs for each release may have already been deleted. This may end up being a @thomcom question, as those 17 second numbers were run on his machine, as per 6c3d853 and this commit (which I could only render ) . We may need him to rerun the notebook on that system to get any realistic understanding about if there is a regression and where it could have happened. @thomcom , may you provide your system details for @harrism and, if you still have that system, rerun the notebook on that system on 22.08 nightlies and provide us numbers?

A bit about the past:

While I can talk about what I recall observing then (and could be mistaken), the challenge you're facing comes from the stoppage of testing and record keeping for regressions on notebooks right around 0.14 and notebooks testing didn't start back up until after 0.18, which wasn't nearly as thorough until very recently. Some libraries reestablished forms of testing and record keeping a bit more whole heartedly than others, but cuGraph remains a shining example of what might help with system spec tracking with their table at the top). For now, would it help that current and future cuSpatial notebooks include a version of cuGraph's tables?

Thoughts on the future:

I absolutely agree on the precise record keeping, with system specs, which is why it's a built in portion of the outputs with that new beta notebooks testing script (although we currently don't test the cuSpatial notebook). We're only collecting data, not yet processing or analyzing it. We're in talks to get it integrated into Ops. Would you like me to start testing cuSpatial notebooks? Happy to share more info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @taureandyernv @harrism I'll try to rerun the notebook this afternoon or tomorrow.

],
"source": [
"!if [ ! -f \"tzones_lonlat.json\" ]; then curl \"https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON\" -o tzones_lonlat.json; else echo \"tzones_lonlat.json found\"; fi\n",
"!if [ ! -f \"taxi2016.csv\" ]; then curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv -o taxi2016.csv; else echo \"taxi2016.csv found\"; fi \n",
"!if [ ! -f \"taxi2017.csv\" ]; then curl https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv -o taxi2017.csv; else echo \"taxi2017.csv found\"; fi"
"!if [ ! -f \"tzones_lonlat.json\" ]; then curl 'https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON' -o tzones_lonlat.json; else echo \"tzones_lonlat.json found\"; fi\n",
"!if [ ! -f \"taxi2016.csv\" ]; then curl 'https://storage.googleapis.com/anaconda-public-data/nyc-taxi/csv/2016/yellow_tripdata_2016-01.csv' -o taxi2016.csv; else echo \"taxi2016.csv found\"; fi\n",
"!if [ ! -f \"taxi2017.csv\" ]; then curl 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2017-01.parquet' -o taxi2017.csv; else echo \"taxi2017.csv found\"; fi"
isVoid marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
Expand All @@ -78,7 +78,9 @@
"outputs": [],
"source": [
"taxi2016 = cudf.read_csv(\"taxi2016.csv\")\n",
"taxi2017 = cudf.read_csv(\"taxi2017.csv\")"
"taxi2016 = taxi2016.astype({\"tpep_pickup_datetime\": \"datetime64[us]\", \"tpep_dropoff_datetime\": \"datetime64[us]\"})\n",
"\n",
"taxi2017 = cudf.read_parquet(\"taxi2017.csv\")"
isVoid marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
Expand All @@ -96,7 +98,7 @@
{
"data": {
"text/plain": [
"{'DOLocationID', 'PULocationID'}"
"{'DOLocationID', 'PULocationID', 'airport_fee', 'congestion_surcharge'}"
]
},
"execution_count": 4,
Expand All @@ -121,7 +123,16 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_4526/2205681195.py:2: UserWarning: Column names longer than 10 characters will be truncated when saved to ESRI Shapefile.\n",
" tzones.to_file('cu_taxi_zones.shp')\n"
]
}
],
"source": [
"tzones = gpd.GeoDataFrame.from_file('tzones_lonlat.json')\n",
"tzones.to_file('cu_taxi_zones.shp')"
Expand Down Expand Up @@ -181,8 +192,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 11.2 s, sys: 6.27 s, total: 17.5 s\n",
"Wall time: 17.7 s\n"
"CPU times: user 22.6 s, sys: 14.6 s, total: 37.2 s\n",
"Wall time: 37.2 s\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the slowdown I'm talking about. More than 2x slower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thanks! I hadn't noticed that at all!

]
}
],
Expand Down Expand Up @@ -227,7 +238,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"0.10632302994054163\n"
"0.10632302994054166\n"
]
}
],
Expand Down Expand Up @@ -275,8 +286,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"0.6013696841885884\n",
"0.5822576814239405\n"
"0.601369302655919\n",
"0.5822614630778491\n"
]
}
],
Expand Down Expand Up @@ -320,7 +331,7 @@
"output_type": "stream",
"text": [
" VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
"542479 1 2017-01-15 17:33:37 2017-01-15 17:35:32 1 \n",
"4487880 1 2017-01-15 17:33:37 2017-01-15 17:35:32 1 \n",
"159122 1 2016-01-01 04:19:02 2016-01-01 04:25:09 3 \n",
"180769 2 2016-01-01 06:40:21 2016-01-01 06:41:09 2 \n",
"4871707 1 2016-01-15 01:07:34 2016-01-15 01:11:50 1 \n",
Expand All @@ -329,42 +340,51 @@
"9259375 1 2016-01-23 10:58:21 2016-01-23 11:20:08 1 \n",
"\n",
" trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n",
"542479 0.0 5 N 204 \n",
"4487880 0.0 5 N 204 \n",
"159122 1.5 1 N 204 \n",
"180769 0.0 5 N 204 \n",
"4871707 1.2 1 N 204 \n",
"7766271 0.0 5 N 204 \n",
"9039663 0.0 5 N 204 \n",
"9259375 4.8 1 N 204 \n",
"\n",
" DOLocationID payment_type ... extra mta_tax tip_amount \\\n",
"542479 204 1 ... 0.0 0.0 10.0 \n",
"159122 204 2 ... 0.5 0.5 0.0 \n",
"180769 204 1 ... 0.0 0.5 0.0 \n",
"4871707 117 1 ... 0.5 0.5 0.0 \n",
"7766271 204 1 ... 0.0 0.0 0.0 \n",
"9039663 204 1 ... 0.0 0.0 10.0 \n",
"9259375 77 2 ... 0.0 0.5 0.0 \n",
" DOLocationID payment_type ... tip_amount tolls_amount \\\n",
"4487880 204 1 ... 10.0 0.0 \n",
"159122 204 2 ... 0.0 0.0 \n",
"180769 204 1 ... 0.0 0.0 \n",
"4871707 117 1 ... 0.0 0.0 \n",
"7766271 204 1 ... 0.0 0.0 \n",
"9039663 204 1 ... 10.0 0.0 \n",
"9259375 77 2 ... 0.0 0.0 \n",
"\n",
" tolls_amount improvement_surcharge total_amount pickup_longitude \\\n",
"542479 0.0 0.3 103.22 null \n",
"159122 0.0 0.3 8.30 -73.83747864 \n",
"180769 0.0 0.3 70.80 -73.83963776 \n",
"4871707 0.0 0.3 7.30 -73.82751465 \n",
"7766271 0.0 0.3 91.10 -73.83338165 \n",
"9039663 0.0 0.3 94.14 -73.8551712 \n",
"9259375 0.0 0.3 19.30 -73.84205627 \n",
" improvement_surcharge total_amount congestion_surcharge \\\n",
"4487880 0.3 103.22 <NA> \n",
"159122 0.3 8.30 <NA> \n",
"180769 0.3 70.80 <NA> \n",
"4871707 0.3 7.30 <NA> \n",
"7766271 0.3 91.10 <NA> \n",
"9039663 0.3 94.14 <NA> \n",
"9259375 0.3 19.30 <NA> \n",
"\n",
" pickup_latitude dropoff_longitude dropoff_latitude \n",
"542479 null null null \n",
"159122 40.57971573 -73.85549164 40.57531738 \n",
"180769 40.57685089 -73.83963776 40.57685089 \n",
"4871707 40.58262253 -73.8052063 40.58834839 \n",
"7766271 40.58089828 -73.83340454 40.58086777 \n",
"9039663 40.57473373 -73.85518646 40.57474899 \n",
"9259375 40.57816315 -73.76178741 40.59595871 \n",
" airport_fee pickup_longitude pickup_latitude dropoff_longitude \\\n",
"4487880 <NA> <NA> <NA> <NA> \n",
"159122 <NA> -73.83747864 40.57971573 -73.85549164 \n",
"180769 <NA> -73.83963776 40.57685089 -73.83963776 \n",
"4871707 <NA> -73.82751465 40.58262253 -73.8052063 \n",
"7766271 <NA> -73.83338165 40.58089828 -73.83340454 \n",
"9039663 <NA> -73.8551712 40.57473373 -73.85518646 \n",
"9259375 <NA> -73.84205627 40.57816315 -73.76178741 \n",
"\n",
"[7 rows x 21 columns]\n"
" dropoff_latitude \n",
"4487880 <NA> \n",
"159122 40.57531738 \n",
"180769 40.57685089 \n",
"4871707 40.58834839 \n",
"7766271 40.58086777 \n",
"9039663 40.57474899 \n",
"9259375 40.59595871 \n",
"\n",
"[7 rows x 23 columns]\n"
]
}
],
Expand All @@ -390,7 +410,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -404,7 +424,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
Expand Down