-
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
1405 lines (1375 loc) · 50.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Google tag (gtag.js) -->
<script
async
src="https://www.googletagmanager.com/gtag/js?id=G-25389D1SR4"
></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-25389D1SR4");
</script>
<meta charset="utf-8" />
<meta
name="viewport"
content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no"
/>
<meta name="robots" content="index,follow" />
<meta
name="keywords"
content="technology, workshop, data visualization, Python, data science, slides"
/>
<meta name="theme-color" content="#000" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:site" content="@StefanieMolin" />
<meta name="twitter:creator" content="@StefanieMolin" />
<meta property="og:type" content="website" />
<meta property="og:locale" content="en_US" />
<meta property="og:site_name" content="Stefanie Molin" />
<meta name="author" content="Stefanie Molin" />
<meta name="referrer" content="origin" />
<meta
property="og:url"
content="https://stefaniemolin.com/data-morph-talk/"
/>
<meta
property="og:title"
content="Data Morph: A Cautionary Tale of Summary Statistics | Stefanie Molin"
/>
<meta
name="description"
content="Relying solely on simple summary statistics like the mean, median, or standard deviation is not enough to describe complex data. Come and see why this is the case and learn what it takes to translate research into an open-source library."
/>
<meta
property="og:description"
content="Relying solely on simple summary statistics like the mean, median, or standard deviation is not enough to describe complex data. Come and see why this is the case and learn what it takes to translate research into an open-source library."
/>
<meta
property="og:image"
content="https://stefaniemolin.com/assets/articles/data-science/introducing-data-morph/panda-to-star.gif"
/>
<meta property="og:image:width" content="774" />
<meta property="og:image:height" content="379" />
<meta
property="og:image:alt"
content="Data Morph: A Cautionary Tale of Summary Statistics"
/>
<title>
Data Morph: A Cautionary Tale of Summary Statistics slides | Stefanie
Molin
</title>
<link rel="manifest" href="/favicon/site.webmanifest" />
<link rel="shortcut icon" type="image/x-icon" href="/favicon/favicon.ico" />
<link
rel="apple-touch-icon"
sizes="180x180"
href="/favicon/apple-touch-icon.png"
/>
<link
rel="icon"
type="image/png"
sizes="32x32"
href="/favicon/favicon-32x32.png"
/>
<link
rel="icon"
type="image/png"
sizes="16x16"
href="/favicon/favicon-16x16.png"
/>
<link
rel="icon"
type="image/png"
sizes="192x192"
href="/favicon/android-chrome-192x192.png"
/>
<link
rel="icon"
type="image/png"
sizes="512x512"
href="/favicon/android-chrome-512x512.png"
/>
<link
rel="stylesheet"
href="https://unpkg.com/reveal.js@5.1.0/dist/reset.css"
/>
<link
rel="stylesheet"
href="https://unpkg.com/reveal.js@5.1.0/dist/reveal.css"
/>
<link
rel="stylesheet"
href="https://unpkg.com/reveal.js@5.1.0/dist/theme/simple.css"
/>
<!--syntax highlighting in code snippets (light theme)-->
<link
rel="stylesheet"
href="https://unpkg.com/@highlightjs/cdn-assets@11.9.0/styles/stackoverflow-light.min.css"
/>
<!-- Font Awesome icons -->
<link
href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.1.1/css/all.min.css"
rel="stylesheet"
type="text/css"
/>
<style type="text/css">
:root {
--r-heading1-size: 2em;
--r-heading2-size: 1.2em;
--r-heading3-size: 1em;
--r-heading4-size: 0.85em;
}
.reveal .slides > section,
.reveal .pdf-page > section,
.reveal .slides > .scroll-page {
text-align: left !important;
}
img {
padding: 0 !important;
margin: 0 !important;
}
.footnotes {
text-indent: -5px;
padding-left: 20px;
width: 90%;
}
.footnote {
text-align: left;
}
.footnote::before {
content: "*";
}
h6 {
font-size: 0.8em;
}
ul,
ol,
p {
font-size: 0.65em;
}
ul.references {
font-size: 0.55em;
}
#bio ul {
font-size: 0.715em;
}
ul > li:not(:last-child),
ol > li:not(:last-child) {
margin-bottom: 0.5em;
}
.center {
text-align: center;
}
small {
font-size: 0.3em !important;
text-align: center;
}
.footer {
position: absolute;
bottom: 10px;
text-wrap: nowrap;
width: 100%;
text-align: center;
z-index: 100;
font-size: 4vmin;
}
.license {
font-size: 1.5vmin;
padding-top: 0.5vmin;
}
pre > code {
padding: 20px !important;
border-radius: 10px;
}
pre.code-wrapper {
border-radius: 10px;
}
:not(pre) > code {
background-color: #eee;
padding: 0.125rem 0.25rem;
border-radius: 5px;
}
.hide-line-numbers .hljs-ln-numbers {
display: none;
}
.r-stack-left {
justify-content: start;
}
.r-stack-left > p {
margin: 0 !important;
}
</style>
</head>
<body>
<div class="reveal">
<div id="footer-info" style="display: none">
<div class="footer">
<a href="https://stefaniemolin.com">stefaniemolin.com</a>
<div class="license">
License:
<a
href="http://creativecommons.org/licenses/by-nc-sa/4.0/"
style="z-index: 1"
target="_blank"
rel="noopener noreferrer"
>
CC BY-NC-SA 4.0
</a>
</div>
</div>
</div>
<div class="slides">
<section>
<h1 class="center">Data Morph</h1>
<h2 class="center">A Cautionary Tale of Summary Statistics</h2>
<br />
<br />
<h3 class="center">Stefanie Molin</h3>
</section>
<section id="bio">
<h2>Bio</h2>
<ul>
<li>👩🏻💻 Software engineer at Bloomberg in NYC</li>
<li>✨ Founding member of Bloomberg's Data Science Community</li>
<li>
✍ Author of "<a
href="https://stefaniemolin.com/books/Hands-On-Data-Analysis-with-Pandas-2nd-edition/"
>Hands-On Data Analysis with Pandas</a
>"
</li>
<li>
🎓 Bachelor's in operations research from Columbia University
</li>
<li>
🎓 Master's in computer science (ML specialization) from Georgia
Tech
</li>
</ul>
<aside class="notes">
<h2>Talk outline</h2>
<div>
<ul>
<li>Why summary statistics aren't enough</li>
<li>Introduction to Data Morph</li>
<li>How Data Morph works</li>
<li>Limitations and areas for future work</li>
<li>Lessons learned and challenges faced</li>
</ul>
</div>
</aside>
</section>
<section id="summary-statistics">
<h2 class="center">Summary statistics aren't enough</h2>
</section>
<section id="visually-different-datasets">
<p>These datasets are clearly different:</p>
<div class="center">
<img src="media/example_datasets.png" alt="example datasets" />
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="visually-different-same-statistics">
<p>
However, we would not know that if we were to only look at the
summary statistics:
</p>
<div class="center">
<img src="media/stats.png" alt="summary statistics are the same" />
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="moments">
<p>
What we call <em>summary statistics</em> summarize only part of the
distribution. We need many <b>moments</b> to describe the shape of a
distribution (and distinguish between these datasets):
</p>
<div class="center">
<img src="media/moments.png" alt="moments" />
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<div>
<small class="footnote"
>The first moment is the center of mass of the distribution (the
mean); here, we have central moments, which are independent of
translation, so our first moment is zero (we subtract the mean).
The second moment is the variance, but once we get to the third
moment (skewness), we can differentiate between these datasets.
Further moments, like kurtosis (fourth), provide even more
information.</small
>
</div>
<div style="margin-top: -8px">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</div>
</section>
<section id="marginal-distributions">
<p>
Adding in histograms for the marginal distributions, we can see the
distributions of both <em>x</em> and <em>y</em> are indeed quite
different across datasets. Some of these differences are captured in
the third moment (<b>skewness</b>) and the fourth moment
(<b>kurtosis</b>), which measure the asymmetry and weight in the
tails of the distribution, respectively:
</p>
<div class="center">
<img src="media/with_marginals.png" alt="marginal distributions" />
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="correlation">
<p>
However, the moments aren't capturing the relationship between
<em>x</em> and <em>y</em>. If we suspect a linear relationship, we
may use the Pearson correlation coefficient, which is the same for
all three datasets below. Here, the visualization tells us a lot
more information about the relationships between the variables:
</p>
<div class="center">
<img src="media/stats.png" alt="summary statistics static" />
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="pearson-correlation-coefficient">
<p>
The Pearson correlation coefficient measures
<em>linear</em> correlation, so if we don't visualize our data, then
we have another problem: a high correlation (close in absolute value
to 1) does not mean the relationship is actually linear. Without a
visualization to contextualize the summary statistics, we do not
have an accurate understanding of the data.
</p>
</section>
<section id="anscombe-quartet">
<p>
For example, all four datasets in
<b>Anscombe's Quartet</b> (constructed in 1973) have strong
correlations, but only <b>I</b> and <b>III</b> have linear
relationships:
</p>
<div class="center">
<img src="media/anscombe.png" alt="Anscombe's Quartet" />
</div>
<div style="text-align: center">
<small
>This visual was created by Stefanie Molin using the Anscombe's
Quartet dataset as provided in
<a
href="https://github.com/mwaskom/seaborn"
target="_blank"
rel="noopener noreferrer"
>seaborn.</a
></small
>
</div>
</section>
<section id="visualization-is-essential">
<h3 class="center">
Visualization is an essential part of any data analysis.
</h3>
</section>
<section id="hypothesis-is-a-liability">
<p>
In their 2020 paper,
<a
href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02133-w"
target="_blank"
rel="noopener noreferrer"
><em>A hypothesis is a liability</em></a
>, researchers Yanai and Lercher argue that
<b
>simply approaching a dataset with a hypothesis may limit the
thoroughness to which the data is explored</b
>.
</p>
<p class="fragment">Let's take a look at their experiment.</p>
</section>
<section id="hypothesis-is-a-liability-experiment">
<h4>The experiment</h4>
<p>
Students in a statistical data analysis course were split into two
groups. One group was given the open-ended task of exploring the
data, while the other group was instructed to test the following
hypotheses:
</p>
<ol style="padding-left: 20px">
<li>
There is a difference in the mean number of steps between women
and men.
</li>
<li>
The correlation coefficient between steps and BMI is negative for
women.
</li>
<li>
The correlation coefficient between steps and BMI is positive for
men.
</li>
</ol>
<div style="text-align: right">
<p style="font-size: small">
<a
href="https://doi.org/10.1101/2020.07.30.228916"
rel="noopener noreferrer"
target="_blank"
>
(Yanai & Lercher, 2020)
</a>
</p>
</div>
</section>
<section id="hypothesis-is-a-liability-experiment-dataset">
<p>Here's what that dataset looked like:</p>
<div class="center">
<img
width="60%"
alt="Figure 1 from 'A hypothesis is a liability' by Itai Yanai & Martin Lercher"
src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13059-020-02133-w/MediaObjects/13059_2020_2133_Fig1_HTML.png?as=webp"
/>
</div>
<div style="text-align: center">
<small
>Figure 1 from
<em
><a
href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02133-w"
target="_blank"
rel="noopener noreferrer"
>A hypothesis is a liability</a
></em
>
by Itai Yanai & Martin Lercher (<a
href="http://creativecommons.org/licenses/by/4.0/"
target="_blank"
rel="noopener noreferrer"
>Creative Commons Attribution 4.0 International License</a
>).</small
>
</div>
</section>
<section id="how-can-we-encourage-thoroughness">
<h4 class="center">
How can we encourage students and practitioners alike to be more
thorough in their analyses?
</h4>
</section>
<section id="teaching-aids">
<h3 class="center">Create more memorable teaching aids</h3>
</section>
<section id="datasaurus-dozen">
<p>
In 2017, Autodesk researchers created the <b>Datasaurus Dozen</b>,
building upon the idea of Anscombe's Quartet to make a more
impactful example:
</p>
<div class="center">
<img
src="media/datasaurus.png"
alt="Datasaurus Dozen"
width="500px"
style="margin: -10px auto"
/>
<br />
<div style="margin: auto 5%">
<small
>This visual was created by Stefanie Molin using the Datasaurus
Dozen dataset as provided by
<a
href="https://github.com/jmatejka/same-stats-different-graphs"
target="_blank"
rel="noopener noreferrer"
>jmatejka/same-stats-different-graphs.</a
></small
>
</div>
</div>
</section>
<section id="animation">
<p>
They also employed animation, which is even more impactful. Every
shape as we transition between the Datasaurus and the circle shares
the same summary statistics:
</p>
<div class="center">
<img
src="media/dino_to_circle.gif"
alt="Datasaurus to circle (Data Morph)"
/>
<br />
<small
>This visual was created by Stefanie Molin using Data
Morph.</small
>
</div>
</section>
<section id="but-no-we-have-a-problem">
<p class="center">But, now we have a new problem...</p>
</section>
<section id="what-is-special-about-the-datasaurus">
<h3 class="center">What's so special about the Datasaurus?</h3>
<h4 class="center fragment">NOTHING!</h4>
</section>
<section id="why-i-built-data-morph">
<p>
Since there was no easy way to do this for arbitrary datasets,
people assumed that this capability is a property of the Datasaurus
and were shocked to see this work with other shapes. The more ways
people see this and the more memorable they are, the better this
concept will stick – repetition is key to learning.
</p>
<p class="fragment">
This is why I built
<a
href="https://stefaniemolin.com/data-morph/stable/index.html"
target="_blank"
rel="noopener noreferrer"
>Data Morph</a
>.
</p>
</section>
<section id="Data-Morph-education-tool">
<h3>Data Morph is an educational tool</h3>
<p>It addresses the limitations of previous methods:</p>
<ul>
<li>
installable Python package that can be used without hacking at the
codebase
</li>
<li>animated results are provided automatically</li>
<li>possible to use additional datasets (built-in and custom)</li>
<li>
people can experiment with their own datasets and various target
shapes
</li>
<li>the number of possible examples is no longer frozen</li>
</ul>
</section>
<section id="introducing-data-morph">
<h2>Data Morph (2023)</h2>
<div class="center">
<img
src="media/Python_to_heart.gif"
alt="morphing the Python logo into a heart"
/>
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="example-code">
<p>Here's the code to create that example:</p>
<pre>
<code data-trim class="language-shell">
$ python -m pip install data-morph-ai
$ data-morph --start-shape Python --target-shape heart
</code>
</pre>
</section>
<section id="behind-the-scenes">
<p>Here's what's going on behind the scenes:</p>
<pre>
<code data-trim class="language-python">
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
</code>
</pre>
</section>
<section id="how-it-works">
<h2 class="center">A high-level overview of how it works</h2>
</section>
<section id="select-a-starting-dataset">
<h3>1. Select a starting dataset</h3>
<pre>
<code data-trim class="language-python hide-line-numbers" data-line-numbers="1,6">
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
</code>
</pre>
</section>
<section id="bounds">
<h4>Automatically-calculated bounds</h4>
<p>
Data Morph provides the <code>Dataset</code> class that wraps the
data (stored as a <code>pandas.DataFrame</code>) with information
about bounds for the data, the morphing process, and plotting. This
allows for the use of arbitrary datasets by providing a way to
calculate target shapes – no more hardcoded values.
</p>
<div class="center">
<img
src="media/bounds.png"
alt="automatically-calculated bounds"
width="400px"
/>
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="built-in-datasets">
<h4>Built-in datasets</h4>
<p>
To spark creativity, there are built-in datasets to inspire you:
</p>
<div style="text-align: center">
<img
src="media/available_datasets.png"
alt="built-in datasets"
width="450px"
/>
<br />
<small
>Note: Currently displaying what's available as of the v0.2.0
release. All logos are used with
<a
href="https://stefaniemolin.com/data-morph/stable/api/data_morph.data.loader.html#id1"
target="_blank"
rel="noopener noreferrer"
>permission</a
>.</small
>
</div>
</section>
<section id="generate-a-target-shape">
<h3>2. Generate a target shape based on the dataset</h3>
<pre>
<code data-trim class="language-python hide-line-numbers" data-line-numbers="3,7">
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
</code>
</pre>
</section>
<section id="scaling-and-translating-target-shapes">
<h4>Scaling and translating target shapes</h4>
<p>
Depending on the target shape, bounds and/or statistics from the
dataset are used to generate a custom target shape for the dataset
to morph into.
</p>
<div class="center">
<img
src="media/fitting_shapes.png"
alt="shapes are calculated based on input data"
width="80%"
/>
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="built-in-target-shapes">
<h4>Built-in target shapes</h4>
<p>The following target shapes are currently available:</p>
<div class="center">
<img
src="media/available_shapes.png"
alt="built-in target shapes"
width="60%"
/>
<br />
<small
>Note: Currently displaying what's available as of the v0.2.0
release.</small
>
</div>
</section>
<section id="shape-class-hierarchy">
<h4>The <code>Shape</code> class hierarchy</h4>
<p>
In Data Morph, shapes are structured as a hierarchy of classes,
which must provide a <code>distance()</code> method. This makes them
interchangeable in the morphing logic.
</p>
<div class="center">
<img src="media/uml/shapes_uml.svg" alt="hierarchy of shapes" />
<br />
<small
>Note: The ... boxes represent classes omitted for space.</small
>
</div>
</section>
<section id="morph">
<h3>3. Morph the dataset into the target shape</h3>
<pre>
<code data-trim class="language-python hide-line-numbers" data-line-numbers="2,9-10">
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
</code>
</pre>
</section>
<section id="simulated-annealing">
<h4>Simulated annealing</h4>
<p>
A point is selected at random (blue) and moved a small, random
amount to a new location (red), preserving summary statistics. This
part of the codebase comes from the Autodesk research and is mostly
unchanged:
</p>
<div class="center">
<img
src="media/simulated_annealing.gif"
alt="example point movement"
width="80%"
/>
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="avoiding-local-optima">
<h4>Avoiding local optima</h4>
<p>
Sometimes, the algorithm will move a point away from the target
shape, while still preserving summary statistics. This helps to
avoid getting stuck:
</p>
<div class="center">
<img
src="media/avoiding_local_optima.gif"
alt="example point movement"
width="80%"
/>
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
</section>
<section id="temperature">
<p>
The likelihood of doing this decreases over time and is governed by
the <b>temperature</b> of the simulated annealing process:
</p>
<div class="center">
<img
src="media/temperature_over_time.png"
alt="temperature over time"
/>
<br />
<small
>The temperature falls to zero as we near the final iterations,
meaning we become more strict about moving toward the target shape
to finalize the output.</small
>
</div>
</section>
<section id="decreasing-point-movement-over-time">
<h4>Decreasing point movement over time</h4>
<p>
The maximum amount that a point can move at a given iteration
decreases over time for a better visual effect. This makes points
move faster when the morphing starts and slow down as we approach
the target shape:
</p>
<div class="center">
<img
src="media/Python_to_heart_forward_only.gif"
alt="morphing the Python logo into a heart"
/>
</div>
<hr align="left" style="width: 33%; margin-bottom: 5px" />
<div class="footnotes">
<div>
<small class="footnote"
>The Python logo is a
<a
href="https://www.python.org/psf/trademarks/"
target="_blank"
rel="noopener noreferrer"
>trademark of the Python Software Foundation (PSF)</a
>, used with permission from the Foundation.</small
>
</div>
<div style="margin-top: -35px">
<small class="footnote"
>Varying point movement over time is not part of the Autodesk
implementation.</small
>
</div>
</div>
<aside class="notes">
<p>
In simulated annealing, we are decreasing temperature over time,
so we can think of the earlier iterations as matter in a gaseous
state (the points are moving fast). As the temperature decreases,
we transition into liquid and eventually solid state, with the
point movement decreasing.
</p>
</aside>
</section>
<section id="point-movement-over-time">
<p>
Unlike temperature, we don't allow this value to fall to zero, since
we don't want to halt movement:
</p>
<div class="center">
<img
src="media/maximum_movement_over_time.png"
alt="easing movement over time"
/>
<br />
<small
>Maximum point movement decreases over time just as temperature
does.</small
>
</div>
</section>
<section id="limitations-and-areas-for-future-work">
<h2 class="center">Limitations and areas for future work</h2>
</section>
<section id="bald-spots">
<h3>“Bald spots”</h3>
<p>
How do we encourage points to fill out the target shape and not just
clump together?
</p>
<div class="center">
<img src="media/bald_spots.png" alt="bald spots limitation" />
</div>
</section>
<section id="morphing-direction">
<h3>Morphing direction</h3>
<p>
Currently, we can only morph from dataset to shape (and shape to
dataset by playing the animation in reverse). I would like to
support dataset to dataset and shape to shape morphing, but there
are challenges to both:
</p>
<table style="font-size: 0.65em">
<thead>
<tr>
<th>Goal</th>
<th>Challenges</th>
</tr>
</thead>
<tbody>
<tr>
<td>shape→shape</td>
<td>
determining the initial sizing and possibly aligning scale
across the shapes, and solving the bald spot problem
</td>