Dataset with categorical features causes memory error even on tiny dataset. #1384

boris-kogan · 2023-07-16T11:52:15Z

Current Behaviour

Dataset with categorical features causes memory error even on tiny dataset.

File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 439, in _render_json
description = self.description_set
File "/usr/local/lib/python3.9/dist-packages/typeguard/init.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 245, in description_set
self._description_set = describe_df(
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/describe.py", line 151, in describe
metrics, duplicates = progress(get_duplicates, pbar, "Detecting duplicates")(
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/utils/progress_bar.py", line 11, in inner
ret = fn(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/multimethod/init.py", line 315, in call
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/pandas/duplicates_pandas.py", line 37, in pandas_get_duplicates
df[duplicated_rows]
File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 2411, in size
return self._reindex_output(result, fill_value=0)
File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 4143, in _reindex_output
index, _ = MultiIndex.from_product(levels_list, names=names).sortlevel()
File "/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/multi.py", line 643, in from_product
codes = cartesian_product(codes)
File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 60, in cartesian_product
return [
File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 62, in
np.repeat(x, b[i]),
File "<array_function internals>", line 180, in repeat
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 479, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 389. GiB for an array with shape (418190868480,) and data type int8

Expected Behaviour

Current ydata-profile code:
duplicated_rows = (
df[duplicated_rows]
.groupby(supported_columns, dropna=False)
.size()
.reset_index(name=duplicates_key)
)

Should be:
duplicated_rows = (
df[duplicated_rows]
.groupby(supported_columns, dropna=False, observed=True)
.size()
.reset_index(name=duplicates_key)
)

Please pay attention to this issue that explains the solution:
https://github.com/pandas-dev/pandas/issues/30552

Data Description

10x10 random strings pandas dataset, with type specified as Category

Code that reproduces the bug

import pandas as pd
from ydata_profiling import ProfileReport
df = pd.DataFrame(data=pd.util.testing.rands_array(10, size=(10, 10)), dtype="category")

report = ProfileReport(df, title="Profiling Report")

display(report)

pandas-profiling version

v4.0.0

Dependencies

pandas==1.3.4

OS

ubuntu

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Common Issues.

The text was updated successfully, but these errors were encountered:

alexbarros · 2023-08-08T19:43:07Z

I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan?

boris-kogan · 2023-08-10T08:29:51Z

Sure. I will create PR

…

On Tue, Aug 8, 2023 at 10:43 PM Alex Barros ***@***.***> wrote: I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan <https://github.com/boris-kogan>? — Reply to this email directly, view it on GitHub <#1384 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BBJVYEJB45FLW7IDEQLXXHDXUKJFNANCNFSM6AAAAAA2L4SGBI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Fixing Bug Report ydataai#1384 Dataset with categorical features causes memory error even on tiny dataset.

Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset.

* Update duplicates_pandas.py (#1427) Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset. * chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1 * chore(actions): update actions/checkout action to v4 * docs: setup new docs with mkdocs (#1418) * chore(actions): update actions/checkout action to v4 * fix: remove the duplicated cardinality threshold under categorical and text settings * fix: fixate matplotlib upper version * docs: change from `zap` to `sparkles` (#1447) Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com> * fix: template {{ file_name }} error in HTML wrapper (#1380) * Update javascript.html * Update style.html * feat: add density histogram (#1458) * feat: add histogram density option * test: add unit test * fix: discard weights if exceed max_bins * docs: update README.html (#1461) Update url of use cases, main integrations, and common issues. * fix: bug when creating a new report (#1440) * fix: gen wordcloud only for non-empty cols (#1459) * fix: table template ignoring text format (#1462) * fix: table template ignoring text format * fix: timeseries unit test * fix(linting): code formatting --------- Co-authored-by: Azory YData Bot <azory@ydata.ai> * fix: to_category misshandling pd.NA (#1464) * docs: add 📊 for Key features (#1451) See also #1445 (comment) * docs: fix hyperlink - related to package name change (#1457) Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> * chore(deps): increase numpy upper limit (#1467) * chore(deps): increase numpy upper limit * chore(deps): fixate numpy version for spark * chore(deps): fix numba package version, and filter warns (#1468) * chore: fix numba package version, and filter warns * fix: skip isort linter on init * chore(deps): update dependency typeguard to v4 (#1324) * chore(deps): update dependency typeguard to v4 --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> * docs: update docs with advent of code * docs: update links for fabric --------- Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai> Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai> Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com> Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com> Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com> Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com> Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com> Co-authored-by: Azory YData Bot <azory@ydata.ai> Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com> Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>

* Update duplicates_pandas.py (#1427) Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset. * chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1 * chore(actions): update actions/checkout action to v4 * docs: setup new docs with mkdocs (#1418) * chore(actions): update actions/checkout action to v4 * fix: remove the duplicated cardinality threshold under categorical and text settings * fix: fixate matplotlib upper version * docs: change from `zap` to `sparkles` (#1447) Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com> * fix: template {{ file_name }} error in HTML wrapper (#1380) * Update javascript.html * Update style.html * feat: add density histogram (#1458) * feat: add histogram density option * test: add unit test * fix: discard weights if exceed max_bins * docs: update README.html (#1461) Update url of use cases, main integrations, and common issues. * fix: bug when creating a new report (#1440) * fix: gen wordcloud only for non-empty cols (#1459) * fix: table template ignoring text format (#1462) * fix: table template ignoring text format * fix: timeseries unit test * fix(linting): code formatting --------- Co-authored-by: Azory YData Bot <azory@ydata.ai> * fix: to_category misshandling pd.NA (#1464) * docs: add 📊 for Key features (#1451) See also #1445 (comment) * docs: fix hyperlink - related to package name change (#1457) Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> * chore(deps): increase numpy upper limit (#1467) * chore(deps): increase numpy upper limit * chore(deps): fixate numpy version for spark * chore(deps): fix numba package version, and filter warns (#1468) * chore: fix numba package version, and filter warns * fix: skip isort linter on init * chore(deps): update dependency typeguard to v4 (#1324) * chore(deps): update dependency typeguard to v4 --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> * docs: update docs with advent of code * docs: update links for fabric * chore(actions): update actions/setup-python action to v5 * docs: add information about PII classification & management. --------- Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai> Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai> Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com> Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com> Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com> Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com> Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com> Co-authored-by: Azory YData Bot <azory@ydata.ai> Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com> Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>

azory-ydata added the needs-triage label Jul 16, 2023

alexbarros added bug 🐛 Something isn't working and removed needs-triage labels Aug 8, 2023

boris-kogan added a commit to boris-kogan/ydata-profiling that referenced this issue Aug 15, 2023

Update duplicates_pandas.py

f0a9840

Fixing Bug Report ydataai#1384 Dataset with categorical features causes memory error even on tiny dataset.

boris-kogan mentioned this issue Aug 15, 2023

fix: update duplicates_pandas.py #1427

Merged

alexbarros pushed a commit that referenced this issue Aug 21, 2023

Update duplicates_pandas.py (#1427)

07d5819

Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset.

fabclmnt added this to YData-profiling roadmap Aug 24, 2023

fabclmnt moved this to In Progress in YData-profiling roadmap Aug 24, 2023

fabclmnt moved this from In Progress to Approval in YData-profiling roadmap Aug 24, 2023

fabclmnt changed the title ~~Bug Report~~ Dataset with categorical features causes memory error even on tiny dataset. Aug 24, 2023

aquemy pushed a commit that referenced this issue Oct 10, 2023

Update duplicates_pandas.py (#1427)

d015ead

Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset.

aquemy pushed a commit that referenced this issue Oct 10, 2023

fix: update duplicates_pandas.py (#1427)

56a6641

Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset with categorical features causes memory error even on tiny dataset. #1384

Dataset with categorical features causes memory error even on tiny dataset. #1384

boris-kogan commented Jul 16, 2023

alexbarros commented Aug 8, 2023

boris-kogan commented Aug 10, 2023 via email

Dataset with categorical features causes memory error even on tiny dataset. #1384

Dataset with categorical features causes memory error even on tiny dataset. #1384

Comments

boris-kogan commented Jul 16, 2023

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist

alexbarros commented Aug 8, 2023

boris-kogan commented Aug 10, 2023 via email