-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathREADME.Rmd
1081 lines (801 loc) · 39.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "R advantages over python"
author: "Iyar Lin"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
github_document:
toc: true
toc_depth: 3
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = T, warning = F, message = F, cache = F, eval = F)
set.seed(1)
options(scipen = 999)
library(reticulate)
```
<br>
*"don't be fooled by the hype python's got*
*R still R still is the tool you want"*

# Motivation
R has many advantages over python that should be taken into consideration when
choosing which language to do DS with. When compiling them in this repo I try to
avoid:
1. Too subjective comparisons. E.g. function indentation vs curly braces closure.
1. Issues that one can get used to after a while like python indexing (though the
fact it starts from 0, or that object[0:2] returns only the first 2
elements still throws me off once in a while).
## How to contribute
I'm adding examples to this repo as I encounter them. There are areas such as
dashboards and apps, advanced statistics or production environments with which I'm
less familiar with. If you'd like to add examples, including where python has the
edge over R (there's a small [section](#python_better_than_r) for that) feel free
to add them to this README and open a pull request.
**I encourage the reader to point out cases
where I'm wrong** - for example when better ways to perform a task in python
exist. I appreciate the feedback.
Note this repo has a [discussions section](https://github.com/IyarLin/R-advantages-over-python/discussions)
so feel free to comment there.
# Working with dplyr is much faster than pandas
Many popularity comparisons show stack overflow questions to indicate that pandas
popularity is growing much faster than dplyr. I think that at least some of the
increase in pandas questions has to do with how confusing pandas is.
I hope by the end of this section you'll find merit in my hypothesis.
{width=500px}
```{r, echo=FALSE, eval=T}
library(dplyr)
iris <- read.csv("stuff/iris.csv")
```
```{python, echo=FALSE, eval=T}
import pandas as pd
import numpy as np
iris = pd.read_csv("stuff/iris.csv")
```
<a name="aggregation"></a>
## Aggregation
This section will demonstrate that pandas syntax really
contrasts with the principle from Zen of python: "There should be one— and preferably
only one —obvious way to do it".
We'll start with a simple example: calculate the mean Sepal length within each species in the iris dataset.
In dplyr:
```{r}
iris %>%
group_by(Species) %>%
summarise(mean_length = mean(Sepal.Length))
```
A common way of doing the same in pandas would be using the *agg* method:
```{python}
(
iris
.groupby('Species')
.agg({'Sepal.Length':'mean'})
.rename({'Sepal.Length':'mean_length'}, axis = 1)
)
```
We can see that pandas requires an additional *rename* call.
We can avoid the additional *rename* by passing a tuple to *agg*:
```{python}
iris.groupby('Species').agg(mean_length = ('Sepal.Length', 'mean'))
```
When aggregating over a single column we could pass a string rather than a tuple like so:
```{python}
iris.groupby('Species')['Sepal.Length'].agg(mean_length = 'mean')
```
When aggregating over a single column with a function that is also a DataFrame method one could skip the *agg* method altogether using:
```{python}
iris.groupby('Species')['Sepal.Length'].mean()
```
The above demonstrated again how confusing pandas can be with all those different ways of doing the same thing.
### Multiple input variables (weighted average example)
Now let's say we'd like to use a weighted average (with sepal width as weights).
In dplyr we'd use the weighted mean function with an additional argument:
```{r}
iris %>%
group_by(Species) %>%
summarize(weighted_mean_length = weighted.mean(Sepal.Length, Sepal.Width))
```
Pretty straight forward. In fact, it's so straight forward we can do the actual
weighted mean calculation on the fly:
```{r}
iris %>%
group_by(Species) %>%
summarize(weighted_mean_length = sum(Sepal.Length * Sepal.Width) / sum(Sepal.Width))
```
In pandas I found an [answer](https://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns/10964938#10964938) which got 104 upvotes on stack overflow:
```{python}
def weighted_mean(group):
d = {}
x = group['Sepal.Length']
w = group['Sepal.Width']
d['weighted_mean_length'] = (x * w).sum() / w.sum()
return (x * w).sum() / w.sum()
(
iris
.groupby('Species')
.apply(weighted_mean)
)
```
Problem is - we get back a series, not a DataFrame. I've searched stack overflow
some more and combined the above with this [answer](https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns/47103408#47103408)
(look how long it is!) to get the following:
```{python}
def weighted_mean(group):
d = {}
x = group['Sepal.Length']
w = group['Sepal.Width']
d['weighted_mean_length'] = (x * w).sum() / w.sum()
return pd.Series(d, index=['weighted_mean_length'])
(
iris
.groupby('Species')
.apply(weighted_mean)
)
```
We can see that:
1. We have to define a custom function, and it can't even work for general inputs
but rather has to have them hard coded.
2. The syntax is super cumbersome and requires searching stack overflow extensively.
3. We need to use *apply* instead of the common *agg* method.
4. I'm pretty sure anyone not using the above code for more than a few weeks
would have to search stack overflow/his code base again to find the answer next
time he needs to do that calculation.
Following a [Christophe's](https://medium.com/@christopherpynn) comment on my Medium
post I found out there's a much simpler solution:
```{python}
(
iris
.groupby('Species')
.apply(lambda x: pd.Series({'weighted_mean_length':
np.average(x['Sepal.Length'], weights = x['Sepal.Width'])}))
)
```
Quite surprising given the amount of upvotes those stack overflow answers got.
Another comment by [Samuel Oranyeli](https://gist.github.com/samukweku/88272c539743e9507dd275e1d2d71018)
revealed one could actually use the *pipe* method like so:
```{python}
(
iris
.assign(weighted_sum = iris['Sepal.Length'].mul(iris['Sepal.Width']))
.groupby('Species')
.pipe(lambda df: df.weighted_sum.sum()/df['Sepal.Width'].sum())
)
```
So... using yet another new method: *pipe*, which quite frankly I have a hard
time understanding how is different than *apply*. (There's an active stack
overflow [question](https://stackoverflow.com/questions/47226407/pandas-groupby-pipe-vs-apply#:~:text=1%20Answer&text=What%20pipe%20does%20is%20to,that%20was%20passed%20to%20apply%20.) on that).
### Multiple functions on multiple input variables
Let's say we'd like to calculate the mean and max sepal length, and the min sepal width.
In dplyr it's easy enough:
```{r, eval=TRUE}
iris %>% summarise(
sepal_length_mean = mean(Sepal.Length),
sepal_length_max = max(Sepal.Length),
sepal_width_min = min(Sepal.Width))
```
If you were to group it by Species for example, you'd get the exact same data frame, only this time with a row per each species like you'd expect:
```{r, eval=TRUE}
iris %>%
group_by(Species) %>%
summarise(
sepal_length_mean = mean(Sepal.Length),
sepal_length_max = max(Sepal.Length),
sepal_width_min = min(Sepal.Width))
```
In pandas you get a very different behavior when the data frame is grouped or not. When it's ungrouped:
```{python, eval=T}
(
iris
.agg({'Sepal.Length':['mean', 'max'], 'Sepal.Width':'min'})
)
```
We can see that the function names were stored in the index while the variables on which they were operated on are in the columns. This results in all those pesky NaNs.
If we were to apply those functions to a grouped data frame however:
<a name="multi_index"></a>
```{python, eval=T}
(
iris
.groupby('Species')
.agg({'Sepal.Length':['mean', 'max'], 'Sepal.Width':'min'})
)
```
Now the grouping variable levels occupy the row index while the functions and variables were moved to the column multi index. It prints nicely but also begs the question which you'll have go search stack overflow of how do you rename those columns or select them by name.
Well what if you just want to perform several aggregations without grouping? Only way I found doing that was:
```{python, eval=T}
(
iris
.groupby(lambda x: 1)
.agg({'Sepal.Length':['mean', 'max'], 'Sepal.Width':'min'})
)
```
## Window functions
### Aggregation over a window
Let's say we'd like to calculate the mean of sepal length within each species
and append that to the original dataset (In SQL: SUM(Sepal.Length) OVER(partition by Species)) would be:
```{python}
iris.assign(mean_sepal = lambda x: x.groupby('Species')['Sepal.Length'].transform(np.mean))
```
We can see that this requires a dedicated method (*transform*), compared with
dplyr which only requires adding a group_by:
```{r}
iris %>%
group_by(Species) %>%
mutate(mean_sepal = mean(Sepal.Length))
```
Now let's say we'd like to do the same with a function that takes 2 argument
variables. In dplyr it's pretty straight forward and again,
just a minor and intuitive tweak of the previous code:
```{r}
iris %>%
group_by(Species) %>%
mutate(mean_sepal = weighted.mean(Sepal.Length, Sepal.Width))
```
Thanks to [Samuel Oranyeli](https://gist.github.com/samukweku/88272c539743e9507dd275e1d2d71018)
I now know this can be achieved in pandas using:
```{python}
(
iris
.assign(weighted_sum = iris['Sepal.Length'].mul(iris['Sepal.Width']),
mean_sepal = lambda df: df.groupby('Species')
.pipe(lambda df: df.weighted_sum.transform('sum')/
df['Sepal.Width'].transform('sum'))
)
.drop(columns = "weighted_sum")
)
```
You can judge for yourself how elegant or straightforward this is.
### Expanding windows
Now let's say we'd like to calculate an expanding sum of Sepal.Length over increasing
values of Sepal.Width within each species (in SQL: SUM(Sepal.Length)
OVER(partition by Species ORDER BY Sepal.Width))
In dplyr it's pretty straight forward:
```{r}
iris %>%
arrange(Species, Sepal.Width) %>%
group_by(Species) %>%
mutate(expanding_sepal_sum = cumsum(Sepal.Length))
```
Generally for a function `f` we can use sapply like so:
```{r}
f <- function(x) median(x) + 1
iris %>%
arrange(Species, Sepal.Width) %>%
group_by(Species) %>%
mutate(expanding_median_plus_1 = sapply(1:n(), function(x) f(Sepal.Length[1:x])))
```
Notice we don't need to memorize any additional functions/methods. One finds
a solution using ubiquitous tools (e.g. sapply) and just plugs it in the dplyr
chain.
In pandas we'll have to search stack overflow to come up with the *expanding*
method:
```{python}
(
iris
.sort_values(['Species', 'Sepal.Width'])
.groupby('Species')
.expanding().agg({'Sepal.Length': 'sum'})
.rename({'Sepal.Length':'expanding_sepal_sum'}, axis = 1)
)
```
Again, we need to use an additional *rename* call.
You'd might want to pass a tuple to *agg* like you're used to in order to avoid
the additional *rename* but for some reason the following syntax just wont work:
```{python}
(
iris
.sort_values(['Species', 'Sepal.Width'])
.groupby('Species')
.expanding()
.agg(expanding_sepal_sum = ('Sepal.Length', 'sum'))
)
```
You could also avoid the additional rename by using the following somewhat
cumbersome syntax:
```{python}
(
iris
.assign(expanding_sepal_sum = lambda x:x.sort_values(['Species', 'Sepal.Width'])
.groupby('Species').expanding().agg({'Sepal.Length': 'sum'})
.reset_index()['Sepal.Length'])
)
```
Yet another way of doing it (credit to [Samuel Oranyeli](https://gist.github.com/samukweku/88272c539743e9507dd275e1d2d71018))
would be:
```{python}
(
iris
.sort_values(["Species", "Sepal.Width"])
.assign(expanding_sepal_sum = lambda df: df.groupby("Species")['Sepal.Length']
.expanding()
.sum()
.array)
)
```
### Moving windows
Now let's say we'd like to calculate a moving central window mean (in SQL: AVG(Sepal.Length)
OVER(partition by Species ORDER BY Sepal.Width ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING))
As usual, in dplyr it's pretty straightforward:
```{r}
iris %>%
arrange(Species, Sepal.Width) %>%
group_by(Species) %>%
mutate(moving_mean_sepal_length = sapply(
1:n(),
function(x) mean(Sepal.Length[max(x - 2, 1):min(x + 2, n())])
))
```
As in the other examples, all you have to do is find a solution using ubiquitous
tools and plug it in the dplyr chain.
In pandas we'd have to look up the *rolling* method, read it's documentation and
come up with the following:
```{python}
(
iris
.sort_values(['Species', 'Sepal.Width'])
.groupby('Species')
.rolling(window = 5, center = True, min_periods = 1)
.agg({'Sepal.Length': 'mean'})
.rename({'Sepal.Length':'moving_mean_sepal_length'}, axis = 1)
)
```
Yet another way of doing it (credit to [Samuel Oranyeli](https://gist.github.com/samukweku/88272c539743e9507dd275e1d2d71018))
would be:
```{python}
(
iris
.sort_values(["Species", "Sepal.Width"])
.assign(expanding_sepal_sum = lambda df: df.groupby("Species")['Sepal.Length']
.rolling(window=5, center=True, min_periods=1)
.mean()
.array)
)
```
## Case when
Below we can see an example of using a case when statement in dplyr:
```{r}
iris %>%
mutate(nice = case_when(
Species == "setosa" & Sepal.Length < 5 ~ "value C",
Species == "versicolor" & Sepal.Length < 5 ~ "Value A",
Species == "virginica" & Sepal.Length < 5 ~ "Value G",
T ~ "high"
))
```
pandas on the other hand [has no dedicated case when function](https://github.com/pandas-dev/pandas/issues/39154).
It uses the **numpy** *select*:
```{python}
iris.assign(wow = lambda x: np.select(
[
(x['Species'] == "setosa") & (x['Sepal.Length'] < 5),
(x['Species'] == "versicolor") & (x['Sepal.Length'] < 5),
(x['Species'] == "virginica") & (x['Sepal.Length'] < 5)
],
[
'value C',
'value A',
'value G'
],
default='high'))
```
We can see that:
1. The syntax is way less readable. In order to match condition with outcome one
needs to index the conditions and outcomes in his head
2. It can get quite messy to understand which value results from what condition
if the list of conditions becomes long
## Coalesce
This function finds the first non-missing value at each position similar to SQL coalesce.
For example we try to extract first non-missing value from 2 fields otherwise return 0.
In dplyr it makes pretty easy and readable
```{r, eval=T}
df <- tibble(s1=c(NA,NA,6,9,9),s2=c(NA,8,7,9,9))
df %>% mutate(s3=coalesce(s1,s2,0))
```
In pandas the solution is much less readable and concise:
```{python, eval =T}
df=pd.DataFrame({'s1':[np.nan,np.nan,6,9,9],'s2':[np.nan,8,7,9,9]})
df.assign(s3 = df.s1.combine_first(df.s2).fillna(0))
```
## pandas is missing variable autocompletion
In dplyr data masking enables you to have variable completion when write
your code. See below how that looks like:

In pandas you pass strings when you select variables, use the *agg*, *sort_values*,
*query* and *groupby* methods. In all those cases - you don't get variable
auto-completion.
## When you learn dplyr you can also leverage data.table, spark, postgres and many others
When you encounter use cases where pandas can't be used (e.g. big data and remote
data bases) you'll have to learn another package to tackle them.
Dplyr users however can use the same syntax to tackle many use cases using various
other languages as back-ends. Granted, for advanced usage one would probably have
to resort to more dedicated API's or use the native language itself. But for the
usual use cases of filtering, selecting, aggregating etc using dplyr syntax
should work fine.
The [dbplyr](https://dbplyr.tidyverse.org/) package enables using remote database
tables as if they are in-memory data frames by automatically converting dplyr
code into SQL. These include: Postgres, BigQuery, Impala and many others.
The [sparklyr](https://spark.rstudio.com/) package enables connecting to spark
clusters and executing spark jobs by converting dplyr syntax to Spark syntax.
The [dtplyr](https://dtplyr.tidyverse.org/) package enables working fast
with large data frames that still fit in memory by using the data.table package
as backend. You can see a [small benchmarking](https://iyarlin.medium.com/data-table-speed-with-dplyr-syntax-yes-we-can-51ef9aaed585)
I did to see how effective it is.
## data.table is way faster than pandas
For most cases when working with data frames that fit in memory there will probably
be not difference in compute time between pandas and dplyr/data.table.
There are instances however when working with medium to large datasets where
pandas will significantly lag behind data.table. There's a whole [website](https://h2oai.github.io/db-benchmark/)
dedicated to benchmarking data base like operations across many languages. We
can see below an example where data.table is the fastest out there, leaving
pandas (among others) in the dust.

python users might rejoice over the fact pandas seems much faster than dplyr.
Thing is dplyr can use data.table as a backend and enjoy all the [syntax joy](#aggregation)
of dplyr [without losing much steam](https://iyarlin.medium.com/data-table-speed-with-dplyr-syntax-yes-we-can-51ef9aaed585).
## pandas index
This one might be a bit subjective, but for me pandas index always seems like
much more of nuisance than useful. I encourage the reader to check how many
times they use the *reset_index* method in their code. Other than some niceties
relating to how the data frame is printed (e.g. multi indeces, see [this](#multi_index)
for example) I can't think of a use case where an index is really needed.
# Rstudio IDE is way better than jupyter notebooks
When working with python one might opt to use the [spyder IDE](https://www.spyder-ide.org/).
Last time I checked it was a poor mans' Rstudio so I restrict my comparison to
jupyter notebooks which are by far the most popular IDE for DS python.
Working with Rmarkdown within Rstudio really feels like jupyter notebooks on
steroids. In addition to the ability to assemble code, plots and tables within
the same document you get tons of other useful features. Since there are some many,
below I'll list just a few notable features that make Rstudio so much better to
work with than jupyter notebooks.
## Code autocompletion
Below we can see a code completion example in Rstudio:

We can see that:
1. When calling functions you get documentation of arguments on the go, both
overview and detailed in pop-up windows
1. Used arguments don't show up in the auto-completion (after setting x it
doesn't show up again)
1. When calling functions in a dplyr chain you have the data frame variables
listed and autocompleted (thanks to data masking)
In contrast see below the auto-completion results for pandas DataFrame in
jupyter notebooks:

We can see you get a long list of irrelevant stuff.
## Console
Rstudio has a command line - this comes handy whenever you want to execute short
pieces of code to check stuff, not to have it in your script. Not having this
in Jupyter notebooks is a pain that is compounded by the absence of the
following feature:
## Running code line by line
Running code line by line can be especially important when you want to debug a
chunk of code. See an example below:
<a name="previous_example"></a>

In jupyter notebooks you'd have to copy each line separately to a new cell,
fix indentation and run it. You can't even just copy it to a console!.
In the above we not only see how easy it is to debug a function in Rstudio,
we also see a run status bar on the left showing you exactly where the error
occurred when running the chunk as a whole (which can also be useful when running
big chunks to know where in the chunk your code runs right now).
Using Rstudio version 1.4.1106 on Mac I can also run python code line by line.
## Table viewing is interactive
When viewing tables with lots of rows it's useful to be able to scroll them row
wise:

In jupyter you can only scroll column wise.
## Variable explorer
In the [previous example](#previuos_example) we could also see a variable explorer which does not
exist in jupyter notebooks. Comes handy especially when context switching or
to dynamically inspect results (e.g. number of rows in a filtered dataframe).
I'm using Rstudio version 1.4.1106 and when using the reticulate package to work
with python I have a variable explorer. At this rate using Rstudio would be
preferable over jupyter notebooks even when using python.

## Debugger
Rstudio has a full fledged [visual debugger!](https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio) The Jupyter blog [announced](https://blog.jupyter.org/a-visual-debugger-for-jupyter-914e61716559)
a visual debugger back in March 2020 but I have yet to see it installed anywhere
in practice. It's possible that's due to the fact it's still experimental and
available for a specific kernel type.
*"The only kernel implementing this protocol, for now, is xeus-python a new
Jupyter kernel for the Python programming language"*
This naturally leads to the next point:
## Installation and dependency management
To take advantage of all the features in Rstudio you need only download the
latest version and install. Jupyter notebooks are much more fragmented with
many of the more advanced features installed separately, often working on only
subsets of machines or requiring managing all sorts of dependencies.
## Table of contents
When working on a very large document - like this one I'm writing right now
in Rstudio it can be become pretty hard to navigate the different document parts.
In Rstudio you get an automatic table of contents:

<a name="documentation_render"></a>
## Documentation rendering
Documentation in Rstudio is rendered very nicely in a separate window, supporting
all sorts of formatting. See for example:

In jupyter notebooks however doc strings are printed within the notebook, so if it's
long one has to scroll up and down to view both the doc and the code he wants
to edit:

# R is just as capable for ML if not better than python
One recurring theme is that R is good for statistics (which is true) while python
(namely scikit-learn) is preferable for ML.
scikit-learn is a meta-ML package - meaning it manages the entire ML pipeline
(cross validation, transformations, benchmarking etc). R has several mature
meta-ML packages of it's own - most prominently [mlr3](https://mlr3.mlr-org.com/)
and [tidymodels](https://www.tidymodels.org/).

You can read more about mlr3 and tidymodels in this great [blog post](https://medium.com/analytics-vidhya/meta-machine-learning-aggregator-packages-in-r-round-ii-71ee1ff68642).
While it's hard to compare sklearn with the R packages, it's safe to say R has
no shortage of high quality meta-ML packages.
## sklearn does not support categorical variables in decision trees
This one's actually pretty important. When constructing a tree in R on a dataset
that contains a categorical variable it can send several categories down each node.
To show that I'll use the iris dataset and create a "sub-species" variable which
is just a split in half of every species (so 6 levels overall).
We next fit a tree model:
```{r, eval=T}
library(rpart)
set.seed(1)
iris_for_tree <- iris %>%
mutate(
subspecies =
paste0(Species, sample.int(2,
size = n(),
replace = T
))
) %>%
select(Sepal.Length, subspecies)
rpart(Sepal.Length ~ subspecies, data = iris_for_tree, cp = 0.02)
```
We can see that on the first split 2 categories (setosa1,setosa2) went down the
left node while the rest went to the right - which makes sense as there's no
real difference between setosa1 and setosa2. In the end the tree split the nodes
exactly according to the original species!
sklearn however [does not support categorical data](https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart).
This means we have to one hot encode those - effectively meaning you can only do
one vs the rest splitting. In the example above setosa1 and setosa2 can't be sent
together to the same node and you'd end up making bad splits with the same specie
in different leaves. This can have serious implications, especially in cases
where categories with a large number of levels are present in the dataset.
# Time series
While some effort is underway to close the gap (see for example [pmdarima](https://github.com/alkaline-ml/pmdarima)
package which aims to imitate R's auto.arima) python's packages for time series
analysis, especially in more classic contexts like exponential smoothing are still
way less developed.
Of course,there are models of classical arima and models from the family of exponential smoothing (ETS) in statmodels,but they all require painstaking manual settings of hyperparameters (p, d,q,P, D, Q) of each time series or parameters from the family of exponential models.
The realities of forecasting in the conditions of modern business are such that there are many (very many) time series, they are all of different lengths, many of them are far from stationary and often contain omissions and statistical outliers.
A single framework is required that could decide for itself for each time series what is more appropriate model (ARIMA, regression, theta, prophet or one of the representatives of the ETC family) not to mention not to manually configure the parameters of ARIMA/ETC/... for each time series.
This framework should contain convenient tools for visualizing a variety of time series and model prediction results, and the code should be concise and readable (which is not enough in python)
This framework should contain convenient tools for cross-validation of time series and building different models for each time series in conditions when there are many time series.
This framework should have tools for hierarchical forecasting: this is when the time series of the detailed (lower) level are aggregated to different levels up in time (per week, month, year), by product (brand, category, etc.) or by customers (staff, sales channel, etc.) and then the models, having learned at the levels above, transmit the found patterns to the models of the lower level, which they cannot see from their height.
This framework should be able to extract characteristics from each time series (regardless of their length and nature) so that later clustering of time series can be done based on these characteristics.
This collection of packages is called [tydieverts](https://tidyverts.org/) and the book was written by Rob Hyndman, well-known in the circles of forecasting specialists, already the 3rd edition [Forecasting: Principles and Practice](https://otexts.com/fpp3/) in accordance with the evolution of packages.
(By the way, statmodels likes to refer to this author in the help for their forecasting functions).
The concept of the tydiverts ecosystem is completely identical to the concept of the [tidyverse](https://www.tidyverse.org/) ecosystem and, accordingly, everything is compatible between these two families, which gives explosive performance and code readability.
Below is an example of forecasting a set of time series by a set of models based on the fable package
```{r,message=FALSE,echo=T, results='hide', eval=TRUE}
pckgs <- c('fable','fable.prophet','tsibble','feasts')
# install.packages(pckgs) # install packages of 'tidyverts' ecosystem
lapply(X = c(pckgs,'tidyverse'), FUN = library, character.only = TRUE) # load 'tidyverts' + 'tidyverse' packages
```
```{r, eval=TRUE}
# extract and filter multiple timeseries data frame from tsibble datasets
data(tourism,package = 'tsibble')
tourism_melb <- tourism %>% filter(Region == "Melbourne",State=="Victoria") %>% select(-Region,-State)
tourism_melb # you can see "Key: Purpose" in title (4 time series groups:Business,Holiday,Visiting,Other), see tsibble docs about create key
tourism_melb %>% autoplot(.vars = Trips) # plot multiple timeseries in on line of code
train <- tourism_melb %>% filter_index(~'2015 Q4') # filter time series from start to 4Q 2015
fit <- train %>% # create multiple AUTO-tuning models for each time series by tsibble key
model(
ets = ETS(Trips ~ trend("A")), # auto Exponential models (HOLT, HOLT-Winters,...etc)
arima = ARIMA(Trips), # auto ARIMA
snaive=SNAIVE(Trips), # seasonal NAIVE
#tslm=TSLM(log(Trips)~ trend() + season()), # YOU MAY (UN)COMMENT ANY MODEL WITHOUT CODE CORRECTION BELOW!!! (regression with trend + seasonal)
theta=THETA(Trips), # theta model
prophet=prophet(Trips~ season("year", 4, type = "multiplicative"))
) %>%
mutate(combine=(ets+arima+snaive+prophet+theta)/5) # create model ensemble
fit
```
```{r, eval=TRUE}
fc <- fit %>% forecast(h = "2 years") # create forecasts from each model for each timeseries
fc
```
```{r, eval=TRUE}
fc %>% autoplot(tourism_melb) # plot forecast and actual values by one line of code
```
```{r, eval=TRUE}
ac <- fc %>% accuracy(data = tourism_melb,measures = list(mape=MAPE, rmse=RMSE)) %>%
relocate(Purpose,.before = .model)%>% arrange(Purpose,rmse) # evaluate models accuracy by test period (>'2015 Q4')
ac
```
```{r, eval=TRUE}
best_ac <- ac %>% group_by(Purpose) %>% arrange(rmse) %>% slice(1) %>% ungroup() # get best model for each time series by min(rmse)
best_ac # we can see that for 'Business' and 'Holiday' it prophet, for 'Other' and 'Visiting' - ARIMA
```
```{r, eval=TRUE}
fc_bestmodels <- fc %>% inner_join(best_ac,by = c('Purpose','.model')) %>% tibble() %>%
select(Purpose, Quarter, frcst_val = .mean,.model) # get best forecast values
fc_bestmodels
```
## Feature extraction for clustering or search unusual time series
Below is a small example of how to use the [feasts package](https://feasts.tidyverts.org/) extract characteristics from time series for their subsequent clustering or search for unusual examples after dimension compression
```{r, eval=TRUE}
# https://robjhyndman.com/hyndsight/fbtsa/
# Compute features
tourism # 304 timeseries (key[304] in table title)
tourism_features <- tourism %>% features(Trips, feature_set(pkgs="feasts"))
tourism_features # 50(!!!) features extracted for each time series by 1 line of code
# Unusual time series can be identified by first doing a principal components decomposition
pcs <- tourism_features %>% select(-State, -Region, -Purpose) %>% prcomp(scale=TRUE) %>% augment(tourism_features)
pcs %>% ggplot(aes(x=.fittedPC1, y=.fittedPC2, col=Purpose)) + geom_point() + theme(aspect.ratio=1)
```
```{r, eval=TRUE}
# We can then identify some unusual series.
unusual <- pcs %>% filter(.fittedPC1 > 10.5) %>% select(Region, State, Purpose, .fittedPC1, .fittedPC2)
# plot unusual time series by 1 line of code
tourism %>% semi_join(unusual) %>% autoplot()
```
Achieving the same with python would've taken much more code and effort.
# python has no list equivalent class
In R the list data structure let's one store arbitrary objects in a vector
which can be accessed by both element index and element name. For example:
```{r}
r_list <- list(
a_vector = c("a", "c", "R"), a_scalar = 3,
another_list = list("yay", c(1, 3))
)
## access by index:
r_list[[1]]
## access by names:
r_list$a_scalar
```
In python one can either store data in a list which can be accessed by element
index but has no names:
```{python}
python_list = [["a","c","R"], 3, ["yay", [1,3]]]
python_list[2]
```
or in a dictionary which allows accessing elements by name but not by element
index (in fact it has no ordering at all!)
```{python}
python_dict = {
'a_vector':["a","c","R"],
'a_scalar':3,
'another_list':["yay", [1,3]]}
python_dict['a_vector']
```
# Package management and distribution
In R every package that is hosted on CRAN needs to satisfy a list of requirements
that ensure it's well documented and usable.
One might argue that this stifles code sharing but package hosting and installing
from github is pretty straightforward in R. I do that myself.
pypi on the other hand has no such standards and as a result, you never really
know what you get.

*credit to Adam Soffer for the brilliant reference*
Below I elaborate on a couple of issues that arise from this setup:
## Depndencies management
R package versions rarely break your code - things that used to work just keep
working in new versions.
When you pick up a new library in python it's not rare to need to either downgrade
or look up very specific library versions that work with the one you're trying
to use. This problem is so big that using virtual environments becomes almost a
necessity in order to have any change of managing all the dependencies across different
projects. The headache doesn't end there as you'll need to make sense of which
virtual environment tool to use: there's venv, virtualenv, pyenv and many more!
You can see some of the confusion manifested in this [stack overflow](https://stackoverflow.com/questions/41573587/what-is-the-difference-between-venv-pyvenv-pyenv-virtualenv-virtualenvwrappe) question.
The *packrat* or *renv* packages in R fulfills the same functionality but one rarely **has** to use them. It's more for maintaining reproducibility.
## Documentation
Documentation in R has a predetermined structure. They all need to have a title,
a description, usage pattern, arguments, details section etc which makes reading
new documentation much easier because you know what to expect and where to look
for the information you're after.
R documentation is also [rendered very nicely in the Rstudio IDE](#documentation_render)
(not strictly an R feature by itself I admit).
# Vectorization
In R the basic objects are all vectors and thus vectorized operations are natural.
In python however the basic objects are lists so one has to either use list
comprehensions which are less readable and slower or convert the lists to numpy
arrays to get the same vectorization capabilities.
For example let's say we'd like to multiply all elements of a vector by 2. In R:
```{r vectorizedR, eval=TRUE}
myVector <- 1:5
myVector * 2
```
Doing the same for a list would look like this:
```{python, eval=TRUE}
myList = [1,2,3,4,5]
[i*2 for i in myList]
```
This can be really slow for very large vectors. We could import the numpy module
to get the same functionality:
```{python vectorizedPy, eval=TRUE}
import numpy as np
myVector = np.arange(1,6,1) # notice how the "to" argument is 6 rather than 5.
myVector * 2
```
Here's another example: Suppose we'd like to pick elements of a vector based
on another boolean/index vector. In R:
```{r, eval=TRUE}
indexVector <- 1:5
valueVector <- LETTERS[indexVector]
greaterThan2 <- indexVector > 2
valueVector[greaterThan2]
```
Using python lists isn't so straight forward. Based on this stack overflow
[answer](https://stackoverflow.com/a/3179119)
(which got 147 upvotes - to a question that got 113 upvotes as of 10.6.2021) you'd
need to use list comprehension like so:
```{python, eval=TRUE}
indexVector = [1,2,3,4,5]
valueVector = [r.LETTERS[i - 1] for i in indexVector]
greaterThan2 = [indexVector[i] > 2 for i in range(len(indexVector))]
[val for is_good, val in zip(greaterThan2, valueVector) if is_good]
```
Way less elegant and possibly slow for large vectors.
It's so un-elegant that a special function was written to accomplish the task:
```{python, eval=TRUE}
from itertools import compress
list(compress(valueVector, greaterThan2))
```
Even that solution is pretty cumbersome compared with the R native way of doing
it.