Skip to content

Commit ce2bea8

Browse files
DOCS-#2439: Add Documentation for Modin vs. pandas (#2487)
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
1 parent 24678d0 commit ce2bea8

File tree

4 files changed

+67
-2
lines changed

4 files changed

+67
-2
lines changed

docs/comparisons/pandas.rst

Lines changed: 67 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,69 @@
1-
Modin vs. Pandas
1+
Modin vs. pandas
22
================
33

4-
Coming Soon...
4+
Modin exposes the pandas API through ``modin.pandas``, but it does not inherit the same
5+
pitfalls and design decisions that make it difficult to scale. This page will discuss
6+
how Modin's dataframe implementation differs from pandas, and how Modin scales pandas.
7+
8+
Scalablity of implementation
9+
----------------------------
10+
11+
The pandas implementation is inherently single-threaded. This means that only one of
12+
your CPU cores can be utilized at any given time. In a laptop, it would look something
13+
like this with pandas:
14+
15+
.. image:: /img/pandas_multicore.png
16+
:alt: pandas is single threaded!
17+
:align: center
18+
:scale: 80%
19+
20+
However, Modin's implementation enables you to use all of the cores on your machine, or
21+
all of the cores in an entire cluster. On a laptop, it will look something like this:
22+
23+
.. image:: /img/modin_multicore.png
24+
:alt: modin uses all of the cores!
25+
:align: center
26+
:scale: 80%
27+
28+
The additional utilization leads to improved performance, however if you want to scale
29+
to an entire cluster, Modin suddenly looks something like this:
30+
31+
.. image:: /img/modin_cluster.png
32+
:alt: modin works on a cluster too!
33+
:align: center
34+
:scale: 30%
35+
36+
Modin is able to efficiently make use of all of the hardware available to it!
37+
38+
Memory usage and immutability
39+
-----------------------------
40+
41+
The pandas API contains many cases of "inplace" updates, which are known to be
42+
controversial. This is due in part to the way pandas manages memory: the user may
43+
think they are saving memory, but pandas is usually copying the data whether an
44+
operation was inplace or not.
45+
46+
Modin allows for inplace semantics, but the underlying data structures within Modin's
47+
implementation are immutable, unlike pandas. This immutability gives Modin the ability
48+
to internally chain operators and better manage memory layouts, because they will not
49+
be changed. This leads to improvements over pandas in memory usage in many common cases,
50+
due to the ability to share common memory blocks among all dataframes.
51+
52+
Modin provides the inplace semantics by having a mutable pointer to the immutable
53+
internal Modin dataframe. This pointer can change, but the underlying data cannot, so
54+
when an inplace update is triggered, Modin will treat it as if it were not inplace and
55+
just update the pointer to the resulting Modin dataframe.
56+
57+
API vs implementation
58+
---------------------
59+
60+
It is well known that the pandas API contains many duplicate ways of performing the same
61+
operation. Modin instead enforces that any one behavior have one and only one
62+
implementation internally. This guarantee enables Modin to focus on and optimize a
63+
smaller code footprint while still guaranteeing that it covers the entire pandas API.
64+
Modin has an internal algebra, which is roughly 15 operators, narrowed down from the
65+
original >200 that exist in pandas. The algebra is grounded in both practical and
66+
theoretical work. Learn more in our `VLDB 2020 paper`_. More information about this
67+
algebra can be found in the :doc:`../developer/architecture` documentation.
68+
69+
.. _VLDB 2020 paper: https://arxiv.org/abs/2001.00888

docs/img/modin_cluster.png

213 KB
Loading

docs/img/modin_multicore.png

22.7 KB
Loading

docs/img/pandas_multicore.png

22.3 KB
Loading

0 commit comments

Comments
 (0)