|
1 | | -Modin vs. Pandas |
| 1 | +Modin vs. pandas |
2 | 2 | ================ |
3 | 3 |
|
4 | | -Coming Soon... |
| 4 | +Modin exposes the pandas API through ``modin.pandas``, but it does not inherit the same |
| 5 | +pitfalls and design decisions that make it difficult to scale. This page will discuss |
| 6 | +how Modin's dataframe implementation differs from pandas, and how Modin scales pandas. |
| 7 | + |
| 8 | +Scalablity of implementation |
| 9 | +---------------------------- |
| 10 | + |
| 11 | +The pandas implementation is inherently single-threaded. This means that only one of |
| 12 | +your CPU cores can be utilized at any given time. In a laptop, it would look something |
| 13 | +like this with pandas: |
| 14 | + |
| 15 | +.. image:: /img/pandas_multicore.png |
| 16 | + :alt: pandas is single threaded! |
| 17 | + :align: center |
| 18 | + :scale: 80% |
| 19 | + |
| 20 | +However, Modin's implementation enables you to use all of the cores on your machine, or |
| 21 | +all of the cores in an entire cluster. On a laptop, it will look something like this: |
| 22 | + |
| 23 | +.. image:: /img/modin_multicore.png |
| 24 | + :alt: modin uses all of the cores! |
| 25 | + :align: center |
| 26 | + :scale: 80% |
| 27 | + |
| 28 | +The additional utilization leads to improved performance, however if you want to scale |
| 29 | +to an entire cluster, Modin suddenly looks something like this: |
| 30 | + |
| 31 | +.. image:: /img/modin_cluster.png |
| 32 | + :alt: modin works on a cluster too! |
| 33 | + :align: center |
| 34 | + :scale: 30% |
| 35 | + |
| 36 | +Modin is able to efficiently make use of all of the hardware available to it! |
| 37 | + |
| 38 | +Memory usage and immutability |
| 39 | +----------------------------- |
| 40 | + |
| 41 | +The pandas API contains many cases of "inplace" updates, which are known to be |
| 42 | +controversial. This is due in part to the way pandas manages memory: the user may |
| 43 | +think they are saving memory, but pandas is usually copying the data whether an |
| 44 | +operation was inplace or not. |
| 45 | + |
| 46 | +Modin allows for inplace semantics, but the underlying data structures within Modin's |
| 47 | +implementation are immutable, unlike pandas. This immutability gives Modin the ability |
| 48 | +to internally chain operators and better manage memory layouts, because they will not |
| 49 | +be changed. This leads to improvements over pandas in memory usage in many common cases, |
| 50 | +due to the ability to share common memory blocks among all dataframes. |
| 51 | + |
| 52 | +Modin provides the inplace semantics by having a mutable pointer to the immutable |
| 53 | +internal Modin dataframe. This pointer can change, but the underlying data cannot, so |
| 54 | +when an inplace update is triggered, Modin will treat it as if it were not inplace and |
| 55 | +just update the pointer to the resulting Modin dataframe. |
| 56 | + |
| 57 | +API vs implementation |
| 58 | +--------------------- |
| 59 | + |
| 60 | +It is well known that the pandas API contains many duplicate ways of performing the same |
| 61 | +operation. Modin instead enforces that any one behavior have one and only one |
| 62 | +implementation internally. This guarantee enables Modin to focus on and optimize a |
| 63 | +smaller code footprint while still guaranteeing that it covers the entire pandas API. |
| 64 | +Modin has an internal algebra, which is roughly 15 operators, narrowed down from the |
| 65 | +original >200 that exist in pandas. The algebra is grounded in both practical and |
| 66 | +theoretical work. Learn more in our `VLDB 2020 paper`_. More information about this |
| 67 | +algebra can be found in the :doc:`../developer/architecture` documentation. |
| 68 | + |
| 69 | +.. _VLDB 2020 paper: https://arxiv.org/abs/2001.00888 |
0 commit comments