Skip to content
This repository was archived by the owner on Jun 17, 2024. It is now read-only.

Commit d209bec

Browse files
authored
Merge pull request #96 from chennesy/jt14den-patch-1
adding date conversion part to tidy.md
2 parents 6244234 + f6080c7 commit d209bec

File tree

3 files changed

+66
-0
lines changed

3 files changed

+66
-0
lines changed

episodes/data/df_long.pkl

89.6 KB
Binary file not shown.

episodes/files/data.zip

1.45 KB
Binary file not shown.

episodes/tidy.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,72 @@ df_long.groupby(['branch', 'month'])['circulation'].agg(['sum', 'mean'])
264264

265265
<p>984 rows × 2 columns</p>
266266

267+
## Adding a Date Column
268+
269+
In order to plot this data over time in the data visualization we need to do three things to prepare it. First, we need to combine the year and month columns into its own column. Second, convert the new date column to a [datetime](https://docs.python.org/3/library/datetime.html) objec using the Pandas `to_datetime` function. Third, we assign the date column as our index for the data. These steps will set up our data for plotting.
270+
271+
``` python
272+
df_long['date'] = df_long['year'] + '-' + df_long['month']
273+
```
274+
275+
This will create a new column in our data frame by adding our year and month together separated by a `-`. This setup is not sufficient for us to use `.to_datetime()` to convert the column to something Python and Pandas knows is a date.
276+
277+
```python
278+
df_long['date'] = pd.to_datetime(df_long['date'], format='%Y-%B')
279+
```
280+
`pd.to_datetime()` will do the conversion, but we need to tell it how we have our date formatted. In this case we have year and month name spelled out. To find more format codes, see <https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior>.
281+
282+
If we take a look at the date column, we'll see that datetime automatically adds a day (always `01`) in the absence of any specific day input.
283+
284+
```python
285+
df_long['date']
286+
```
287+
```output
288+
0 2011-01-01
289+
1 2011-01-01
290+
2 2011-01-01
291+
3 2011-01-01
292+
4 2011-01-01
293+
...
294+
11551 2022-12-01
295+
11552 2022-12-01
296+
11553 2022-12-01
297+
11554 2022-12-01
298+
11555 2022-12-01
299+
Name: date, Length: 11556, dtype: datetime64[ns]
300+
```
301+
302+
``` python
303+
df_long.info()
304+
```
305+
306+
``` output
307+
<class 'pandas.core.frame.DataFrame'>
308+
RangeIndex: 11556 entries, 0 to 11555
309+
Data columns (total 9 columns):
310+
# Column Non-Null Count Dtype
311+
--- ------ -------------- -----
312+
0 branch 11556 non-null object
313+
1 address 7716 non-null object
314+
2 city 7716 non-null object
315+
3 zip code 7716 non-null float64
316+
4 ytd 11556 non-null int64
317+
5 year 11556 non-null object
318+
6 month 11556 non-null object
319+
7 circulation 11556 non-null int64
320+
8 date 11556 non-null datetime64[ns]
321+
dtypes: datetime64[ns](1), float64(1), int64(2), object(5)
322+
memory usage: 812.7+ KB
323+
```
324+
325+
That worked! Now, we can make the datetime column the index of our DataFrame. In the Pandas episode we looked at Pandas default numerical index, but we can also use `.set_index()` to declare a specific column as the index of our DataFrame. Using a datetime index will make it easier for us to plot the DataFrame over time. The first parameter of `.set_index()` is the column name and the `inplace=True` parameter allows us to modify the DataFrame without assigning it to a new variable.
326+
327+
328+
``` python
329+
df_long.set_index('date', inplace=True)
330+
```
331+
332+
If we look at the data again, we will see our index will be set to date.
267333

268334
Let's save `df_long` to use in the next episode.
269335

0 commit comments

Comments
 (0)