Improve memory use of read_csv #44

pschafhalter · 2018-07-17T19:48:18Z

read_csv falls back to the Pandas implementation for certain situations. This is expensive in terms of memory due to data duplication; first, we create a Pandas dataframe using pandas.read_csv, and then convert it to a Modin dataframe.

The following events fall back to pandas.read_csv and should be fixed to be more memory efficient:

file does not exist on disk (e.g. located in S3).
filepath_or_buffer is not an instance of str, py.path.local or pathlib.Path
file is compressed (high priority)
as_recarray is True
chunksize is not None
skiprows is list-like or callable
nrows is not None

Most changes need to be done in io.py.

Thanks to @Bidek56 for reporting!

The text was updated successfully, but these errors were encountered:

pschafhalter · 2018-07-17T19:49:37Z

I've assigned myself to this issue for now, but contributions are welcome!

chvsp · 2018-08-06T13:23:57Z

Hi @pschafhalter, I am a newbie here but I love what pandas on ray is doing and want to contribute. I would like to take this issue up to get a better understanding of the internals. I have been going through the modin.pandas code for sometime and it seems that solving the remote file issue will be the easiest.

It would be helpful if you could give a few pointers to get me started. Thanks :)

devin-petersohn · 2018-08-06T17:56:55Z

Hi @chvsp! The best starter parameter from the list above is nrows. The first step would be to make sure you fully understand the behavior of the pandas implementation. This will be a great performance improvement as nrows is a very common parameter to be used.

When you feel you are ready to start implementing it, io.py is where this change will go. If you run into problems or questions along the way, feel free to email the dev group. Thanks again for looking into this!

devin-petersohn · 2020-06-01T14:35:26Z

Closing this. Feel free to reopen if the discussion should continue or if issue was not resolved.

support fillna for bool columns

…-project#44) * Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com> * Update modin/pandas/series_utils.py Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com> * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com> Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>

…a service Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Fixes to pass CI + docs for io.py Update implementation Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Fix some things Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Lint fixes Fix put Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Clean up and add new details Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Use fsspec to get full path and allow URLs Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Add lazy loc Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> fixes for tests porting more tests more fixes moar fixes Raise exception Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Lint fixes Return Python as the default modin engine Handle indexing case for client qc Call fast path for __getitem__ if not lazy Remove user warning for Python-engine fall back Add init Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Implement free as a no-op Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Add support for replace - client side Fix a couple of issues with Client Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Throw errors on to_pandas Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Do not default to pandas for str_repeat Add support for 18 datetime functions/properties Fix columns caching when renaming columns Fix test_query: put backticks back for col names Add support for astype -- client side hard coded changes for functions Client support for str_(en/de)code, to_datetime Add all missing query compiler methods. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix getitem_column_array and take_2d. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix getitem_column_array and take_2d. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix again. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix more bugs. Signed-off-by: mvashishtha <mahesh@ponder.io> More fixes. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix more bugs-- pushdown tests test_dates and test_pivot still broken due to service bugs. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix typo. Note drop() broken because service requires you to specify both argument and client QC at base of this PR uses default Nones. Signed-off-by: mvashishtha <mahesh@ponder.io> Add query compiler class. Signed-off-by: mvashishtha <mahesh@ponder.io> Testing a commit Initial changes for adding support for Expanding FEAT Support for rolling.sem FEAT support for Expanding sum, min, max, mean, var, std, count, sem Removing extratenous comment REFACTOR: Remove defaults to pandas at API layer and add some corresponding client QC methods. Signed-off-by: mvashishtha <mahesh@ponder.io> Add more methods. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix expanding. Signed-off-by: mvashishtha <mahesh@ponder.io> Add ewm. Signed-off-by: mvashishtha <mahesh@ponder.io> Revert whitespace. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix to_numpy by making it like to_pandas. Signed-off-by: mvashishtha <mahesh@ponder.io> Remove extra to_numpy. Signed-off-by: mvashishtha <mahesh@ponder.io> Pass kwargs Signed-off-by: mvashishtha <mahesh@ponder.io> Fix DataFrame import for isin. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix again. Signed-off-by: mvashishtha <mahesh@ponder.io> Remove breakpoint Signed-off-by: mvashishtha <mahesh@ponder.io> Tell if series. Signed-off-by: mvashishtha <mahesh@ponder.io> Fix client qc. Signed-off-by: mvashishtha <mahesh@ponder.io> Add self_is_series. Signed-off-by: mvashishtha <mahesh@ponder.io> FIX: Set numeric_only to True in groupby quantile Add some comments Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin-project#44) * Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com> * Update modin/pandas/series_utils.py Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com> * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com> Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com> FEAT Support expanding.aggregate (modin-project#45) Fix at_time and between_time. (modin-project#43) Signed-off-by: mvashishtha <mahesh@ponder.io> Signed-off-by: mvashishtha <mahesh@ponder.io> Add QC method for groupby.sem (modin-project#47) * FEAT: Add partial support for groupby.sem() * Add sem changes to groupby Fix nlargest and nsmallest Series support (modin-project#46) * Fix nlargest and smallest support Signed-off-by: Naren Krishna <naren@ponder.io> Remove client query compiler's columnarize. (modin-project#48) Signed-off-by: mvashishtha <mahesh@ponder.io> Signed-off-by: mvashishtha <mahesh@ponder.io> Fix info and set memory_usage=False. (modin-project#49) Signed-off-by: mvashishtha <mahesh@ponder.io> Signed-off-by: mvashishtha <mahesh@ponder.io> POND-815 fixes for 21 column dataset (modin-project#50) * POND-815 fixes for 21 column dataset * Update modin/pandas/base.py Co-authored-by: helmeleegy <40042062+helmeleegy@users.noreply.github.com> --------- Co-authored-by: helmeleegy <40042062+helmeleegy@users.noreply.github.com> Bring in upstream series binary operation fix 6d5545f… (modin-project#52) * Bring in upstream series binary operation fix 6d5545f. Signed-off-by: mvashishtha <mahesh@ponder.io> * Update modin/pandas/series.py Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com> --------- Signed-off-by: mvashishtha <mahesh@ponder.io> Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com> Support groupby first/last (modin-project#53) Signed-off-by: Naren Krishna <naren@ponder.io> FEAT: Add initial partial support for groupby.cumcount() (modin-project#54) * FEAT: Add partial support for cumcount * Remove the set_index_name * Squeeze the result * Write cumcount name to None * Can't set dtype to int64 Fix resample sum, prod, size (modin-project#56) Signed-off-by: Naren Krishna <naren@ponder.io> POND-184: fix describe and simplify query compiler interface (modin-project#55) * Fix describe Signed-off-by: mvashishtha <mahesh@ponder.io> * Pass datetime_is_numeric. Signed-off-by: mvashishtha <mahesh@ponder.io> --------- Signed-off-by: mvashishtha <mahesh@ponder.io> Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51) * Fix dt_day_of_week/day_of_year, str_partition/replace/rpartition * Fix str_extract Revert "Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51)" (modin-project#58) This reverts commit f7a31ab. Revert "Revert "Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51)" (modin-project#58)" (modin-project#60) This reverts commit ad9231d. Add query compiler method for groupby.prod() (modin-project#57) Signed-off-by: Naren Krishna <naren@ponder.io> FEAT: Add support for groupby.head and groupby.tail (modin-project#61) * FEAT: Add support for groupby.head and groupby.tail * Change _change_index FEAT: Add partial support for groupby.nth (modin-project#62) FIX: Push first and last down to query compiler. (modin-project#64) * FIX: Push first and last down to query compiler. Signed-off-by: mvashishtha <mahesh@ponder.io> * Fix last. Signed-off-by: mvashishtha <mahesh@ponder.io> --------- Signed-off-by: mvashishtha <mahesh@ponder.io> FEAT: Add partial support for groupby.ngroup (modin-project#65) * FEAT: Add partial support for groupby.ngroup * Name of result should be none for now Add client support for SeriesGroupby unique, nsmallest, nlargest (modin-project#63) * Add client support for SeriesGroupby unique, nsmallest, nlargest Signed-off-by: Naren Krishna <naren@ponder.io> --------- Signed-off-by: Naren Krishna <naren@ponder.io> Push memory_usage entirely to query compiler [change is not to be upstreamed to Modin] (modin-project#66) * Fix dataframe memory usage. Signed-off-by: mvashishtha <mahesh@ponder.io> * Fix series memory_usage() the same way. Signed-off-by: mvashishtha <mahesh@ponder.io> --------- Signed-off-by: mvashishtha <mahesh@ponder.io> FIX: allow updating backend query compilers in place. (modin-project#67) * FIX: Mutate client query compiler columns and index in the service. Motivation: Align axis update semantics across query compilers. In the base query compiler and even our service's query compiler, you can update the index and columns in place. However, the service gives no way to update axes of a query compiler. Right now, for inplace updates, service exposes an extra method rename(), and client query compiler uses this to get the id of a new compiler with updated axis, and then updates its id ID of the new query compiler. This change might be the first to make the service present a mutable interface for a backend query compiler. That seems safe to me, except I had to make copy() get a new query compiler copied from the old query compiler, because we can't let updates to the new query compiler change the original (or vice versa). Signed-off-by: mvashishtha <mahesh@ponder.io> * Add a comment. Signed-off-by: mvashishtha <mahesh@ponder.io> --------- Signed-off-by: mvashishtha <mahesh@ponder.io> FEAT replace groupby.fillna with a simpler logic (modin-project#68) * FEAT Support expanding.aggregate * Replaced groupby.fillna logic with a simpler one * Fix in groupby.fillna. Work object was causing problems. * Only need to change _check_index_name to _check_index * Removed commented out code.

…-project#44) Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>

…n-project#44) This reverts commit abf176c. It's no longer useful for performance.

pschafhalter added enhancement good first issue 🔰 Good for newcomers Performance 🚀 Performance related issues and pull requests. labels Jul 17, 2018

pschafhalter self-assigned this Jul 17, 2018

devin-petersohn removed the enhancement label Sep 28, 2018

simon-mo removed the good first issue 🔰 Good for newcomers label Oct 10, 2018

devin-petersohn closed this as completed Jun 1, 2020

dchigarev pushed a commit to dchigarev/modin that referenced this issue Aug 25, 2020

Merge pull request modin-project#44 from intel-go/ienkovich/fillna-bool

3c7f79a

support fillna for bool columns

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this issue Feb 27, 2023

Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin…

2487278

…-project#44) Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this issue Mar 16, 2023

Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin…

b09e026

…-project#44) Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this issue Mar 16, 2023

Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin…

7be9038

…-project#44) Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this issue Mar 16, 2023

Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin…

dd058c7

…-project#44) Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this issue Mar 16, 2023

Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin…

ebc509d

…-project#44) Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>

RehanSD pushed a commit to RehanSD/modin that referenced this issue May 23, 2023

Revert "PERF: Remove columnarize [upstream] (modin-project#23)" (modi…

941a49b

…n-project#44) This reverts commit abf176c. It's no longer useful for performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory use of read_csv #44

Improve memory use of read_csv #44

pschafhalter commented Jul 17, 2018

pschafhalter commented Jul 17, 2018

chvsp commented Aug 6, 2018

devin-petersohn commented Aug 6, 2018

devin-petersohn commented Jun 1, 2020

Improve memory use of read_csv #44

Improve memory use of read_csv #44

Comments

pschafhalter commented Jul 17, 2018

pschafhalter commented Jul 17, 2018

chvsp commented Aug 6, 2018

devin-petersohn commented Aug 6, 2018

devin-petersohn commented Jun 1, 2020