Skip to content

Commit

Permalink
DOCS-#3904: Improving Modin README (#3929)
Browse files Browse the repository at this point in the history
Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>
Co-authored-by: Doris Lee <dorisjunglinlee@gmail.com>
Signed-off-by: Naren Krishna <naren@ponder.io>
  • Loading branch information
3 people authored Jan 25, 2022
1 parent 9d1a334 commit 5d84042
Show file tree
Hide file tree
Showing 7 changed files with 139 additions and 137 deletions.
214 changes: 101 additions & 113 deletions README.md

Large diffs are not rendered by default.

46 changes: 30 additions & 16 deletions docs/getting_started/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The :py:class:`~modin.pandas.dataframe.DataFrame` is a highly
scalable, parallel DataFrame. Modin transparently distributes the data and computation so
that you can continue using the same pandas API while being able to work with more data faster.
Modin lets you use all the CPU cores on your machine, and because it is lightweight, it
often has less memory overhead than pandas. See this :doc:`page </getting_started/pandas>` to
often has less memory overhead than pandas. See this :doc:`page </getting_started/why_modin/pandas>` to
learn more about how Modin is different from pandas.

Why not just improve pandas?
Expand Down Expand Up @@ -54,7 +54,14 @@ with dataframes that don't fit into the available memory. As a result, pandas wo
for prototyping on a few MBs of data do not scale to tens or hundreds of GBs (depending on the size
of your machine). Modin supports operating on data that does not fit in memory, so that you can comfortably
work with hundreds of GBs without worrying about substantial slowdown or memory errors. For more information,
see :doc:`out-of-memory support <getting_started/out_of_core.rst>` for Modin.
see :doc:`out-of-memory support </getting_started/why_modin/out_of_core>` for Modin.

How does Modin compare to Dask DataFrame and Koalas?
""""""""""""""""""""""""""""""""""""""""""""""""""""

TLDR: Modin has better coverage of the pandas API, has a flexible backend, better ordering semantics,
and supports both row and column-parallel operations.
Check out this :doc:`page </getting_started/why_modin/modin_vs_dask_vs_koalas>` detailing the differences!

How does Modin work under the hood?
"""""""""""""""""""""""""""""""""""
Expand Down Expand Up @@ -96,14 +103,12 @@ import with Modin import:
Which execution engine (Ray or Dask) should I use for Modin?
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Whichever one you want! Modin supports Ray_ and Dask_ execution engines to provide an effortless way
to speed up your pandas workflows. The best thing is that you don't need to know
anything about Ray and Dask in order to use Modin and Modin will automatically
detect which engine you have
installed and use that for scheduling computation. If you don't have a preference, we recommend
starting with Modin's default Ray engine. If you want to use a specific
compute engine, you can set the environment variable ``MODIN_ENGINE`` and
Modin will do computation with that engine:
Modin lets you effortlessly speed up your pandas workflows with either Ray_'s or Dask_'s execution engine.
You don't need to know anything about either engine in order to use it with Modin. If you only have one engine
installed, Modin will automatically detect which engine you have installed and use that for scheduling computation.
If you don't have a preference, we recommend starting with Modin's default Ray engine.
If you want to use a specific compute engine, you can set the environment variable ``MODIN_ENGINE``
and Modin will do computation with that engine:

.. code-block:: bash
Expand All @@ -113,6 +118,15 @@ Modin will do computation with that engine:
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
export MODIN_ENGINE=dask # Modin will use Dask
This can also be done with:

.. code-block:: python
from modin.config import Engine
Engine.put("ray") # Modin will use Ray
Engine.put("dask") # Modin will use Dask
We also have an experimental OmniSciDB-based engine of Modin you can read about :doc:`here </development/using_omnisci>`.
We plan to support more execution engines in future. If you have a specific request,
please post on the #feature-requests channel on our Slack_ community.
Expand Down Expand Up @@ -158,16 +172,16 @@ How can I contribute to Modin?

**Modin is currently under active development. Requests and contributions are welcome!**

If you are interested in contributing please check out the :doc:`Getting Started</getting_started/index>`
guide then refer to the :doc:`Development Documentation</development/index>` section,
If you are interested in contributing please check out the :doc:`Contributing Guide</development/contributing>`
and then refer to the :doc:`Development Documentation</development/index>`,
where you can find system architecture, internal implementation details, and other useful information.
Also check out the `Github`_ to view open issues and make contributions.

.. _issue: https://github.com/modin-project/modin/issues
.. _Slack: https://modin.org/slack.html
.. _Slack: https://join.slack.com/t/modin-project/shared_invite/zt-yvk5hr3b-f08p_ulbuRWsAfg9rMY3uA
.. _Github: https://github.com/modin-project/modin
.. _Ray: https://github.com/ray-project/ray/
.. _Dask: https://dask.org/
.. _papers: https://arxiv.org/abs/2001.00888
.. _guide: https://modin.readthedocs.io/en/stable/installation.html?#installing-on-google-colab
.. _Dask: https://github.com/dask/dask
.. _papers: https://people.eecs.berkeley.edu/~totemtang/paper/Modin.pdf
.. _guide: https://modin.readthedocs.io/en/latest/getting_started/installation.html#installing-on-google-colab
.. _tutorial: https://github.com/modin-project/modin/tree/master/examples/tutorial
6 changes: 3 additions & 3 deletions docs/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To install the most recent stable release run the following:
pip install -U modin # -U for upgrade in case you have an older version
Modin can be used with :doc:`Ray</developer/using_pandas_on_ray>`, :doc:`Dask</developer/using_pandas_on_dask>`, or :doc:`OmniSci</developer/using_omnisci>` engines. If you don't have Ray_ or Dask_ installed, you will need to install Modin with one of the targets:
Modin can be used with :doc:`Ray</development/using_pandas_on_ray>`, :doc:`Dask</development/using_pandas_on_dask>`, or :doc:`OmniSci</development/using_omnisci>` engines. If you don't have Ray_ or Dask_ installed, you will need to install Modin with one of the targets:

.. code-block:: bash
Expand Down Expand Up @@ -147,8 +147,8 @@ that these changes have not made it into a release and may not be completely sta
Windows
-------

All Modin engines except :doc:`OmniSci</developer/using_omnisci>` are available both on Windows and Linux as mentioned above.
Default engine on Windows is :doc:`Ray</developer/using_pandas_on_ray>`.
All Modin engines except :doc:`OmniSci</development/using_omnisci>` are available both on Windows and Linux as mentioned above.
Default engine on Windows is :doc:`Ray</development/using_pandas_on_ray>`.
It is also possible to use Windows Subsystem For Linux (WSL_), but this is generally
not recommended due to the limitations and poor performance of Ray on WSL, a roughly
2-3x worse than native Windows.
Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Please note, that while Modin covers a large portion of the pandas API, not all
UserWarning: `DataFrame.asfreq` defaulting to pandas implementation.
To understand which functions will lead to this warning, we have compiled a list of :doc:`currently supported methods </supported_apis/index>`. When you see this warning, Modin defaults to pandas by converting the Modin dataframe to pandas to perform the operation. Once the operation is complete in pandas, it is converted back to a Modin dataframe. These operations will have a high overhead due to the communication involved and will take longer than pandas. When this is happening, a warning will be given to the user to inform them that this operation will take longer than usual. You can learn more about this :doc:`here <supported_apis/defaulting_to_pandas>`.
To understand which functions will lead to this warning, we have compiled a list of :doc:`currently supported methods </supported_apis/index>`. When you see this warning, Modin defaults to pandas by converting the Modin dataframe to pandas to perform the operation. Once the operation is complete in pandas, it is converted back to a Modin dataframe. These operations will have a high overhead due to the communication involved and will take longer than pandas. When this is happening, a warning will be given to the user to inform them that this operation will take longer than usual. You can learn more about this :doc:`here </supported_apis/defaulting_to_pandas>`.

If you would like to request a particular method be implemented, feel free to `open an
issue`_. Before you open an issue please make sure that someone else has not already
Expand Down
4 changes: 2 additions & 2 deletions docs/getting_started/using_modin/using_modin_locally.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ cluster for you:
Finally, if you already have an Ray or Dask engine initialized, Modin will
automatically attach to whichever engine is available. If you are interested in using
Modin with OmniSci engine, please refer to :doc:`these instructions </developer/using_omnisci>`. For additional information on other settings you can configure, see
Modin with OmniSci engine, please refer to :doc:`these instructions </development/using_omnisci>`. For additional information on other settings you can configure, see
:doc:`this page </flow/modin/config>` for more details.

Advanced: Configuring the resources Modin uses
Expand Down Expand Up @@ -116,4 +116,4 @@ specify more processors than you have available on your machine; however this wi
improve the performance (and might end up hurting the performance of the system).

.. note::
Make sure to update the ``MODIN_CPUS`` configuration and initialize your preferred engine before you start working with the first operation using Modin! Otherwise, Modin will opt for the default setting.
Make sure to update the ``MODIN_CPUS`` configuration and initialize your preferred engine before you start working with the first operation using Modin! Otherwise, Modin will opt for the default setting.
2 changes: 1 addition & 1 deletion docs/getting_started/why_modin/modin_vs_dask_vs_koalas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,4 +84,4 @@ Performance Comparison
**Modin provides substantial speedups even on operators not supported by other systems.** Thanks to its flexible partitioning schemes that enable it to support the vast majority of pandas operations — be it row, column, or cell-oriented - Modin provides benefits on operations such as ``join``, ``median``, and ``infer_types``. While Koalas performs ``join`` slower than Pandas, Dask failed to support ``join`` on more than 20M rows, likely due poor support for `shuffles <https://coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept/>`_. Details of the benchmark and additional join experiments can be found in `our paper <https://people.eecs.berkeley.edu/~totemtang/paper/Modin.pdf>`_.

.. _documentation: http://docs.dask.org/en/latest/DataFrame.html#design.
.. _Modin's documentation: https://modin.readthedocs.io/en/latest/developer/architecture.html
.. _Modin's documentation: https://modin.readthedocs.io/en/latest/development/architecture.html
2 changes: 1 addition & 1 deletion docs/getting_started/why_modin/pandas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,6 @@ smaller code footprint while still guaranteeing that it covers the entire pandas
Modin has an internal algebra, which is roughly 15 operators, narrowed down from the
original >200 that exist in pandas. The algebra is grounded in both practical and
theoretical work. Learn more in our `VLDB 2020 paper`_. More information about this
algebra can be found in the :doc:`../development/architecture` documentation.
algebra can be found in the :doc:`architecture </development/architecture>` documentation.

.. _VLDB 2020 paper: https://arxiv.org/abs/2001.00888

0 comments on commit 5d84042

Please sign in to comment.