Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset with categorical features causes memory error even on tiny dataset. #1384

Open
3 tasks done
boris-kogan opened this issue Jul 16, 2023 · 2 comments
Open
3 tasks done
Labels
bug 🐛 Something isn't working

Comments

@boris-kogan
Copy link
Contributor

Current Behaviour

Dataset with categorical features causes memory error even on tiny dataset.

File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 439, in _render_json
description = self.description_set
File "/usr/local/lib/python3.9/dist-packages/typeguard/init.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 245, in description_set
self._description_set = describe_df(
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/describe.py", line 151, in describe
metrics, duplicates = progress(get_duplicates, pbar, "Detecting duplicates")(
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/utils/progress_bar.py", line 11, in inner
ret = fn(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/multimethod/init.py", line 315, in call
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/pandas/duplicates_pandas.py", line 37, in pandas_get_duplicates
df[duplicated_rows]
File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 2411, in size
return self._reindex_output(result, fill_value=0)
File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 4143, in _reindex_output
index, _ = MultiIndex.from_product(levels_list, names=names).sortlevel()
File "/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/multi.py", line 643, in from_product
codes = cartesian_product(codes)
File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 60, in cartesian_product
return [
File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 62, in
np.repeat(x, b[i]),
File "<array_function internals>", line 180, in repeat
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 479, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 389. GiB for an array with shape (418190868480,) and data type int8

Expected Behaviour

Current ydata-profile code:
duplicated_rows = (
df[duplicated_rows]
.groupby(supported_columns, dropna=False)
.size()
.reset_index(name=duplicates_key)
)

Should be:
duplicated_rows = (
df[duplicated_rows]
.groupby(supported_columns, dropna=False, observed=True)
.size()
.reset_index(name=duplicates_key)
)

Please pay attention to this issue that explains the solution:
https://github.com/pandas-dev/pandas/issues/30552

Data Description

10x10 random strings pandas dataset, with type specified as Category

Code that reproduces the bug

import pandas as pd
from ydata_profiling import ProfileReport
df = pd.DataFrame(data=pd.util.testing.rands_array(10, size=(10, 10)), dtype="category")

report = ProfileReport(df, title="Profiling Report")

display(report)

pandas-profiling version

v4.0.0

Dependencies

pandas==1.3.4

OS

ubuntu

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
@alexbarros alexbarros added bug 🐛 Something isn't working and removed needs-triage labels Aug 8, 2023
@alexbarros
Copy link
Contributor

I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan?

@boris-kogan
Copy link
Contributor Author

boris-kogan commented Aug 10, 2023 via email

boris-kogan added a commit to boris-kogan/ydata-profiling that referenced this issue Aug 15, 2023
Fixing Bug Report ydataai#1384
Dataset with categorical features causes memory error even on tiny dataset.
alexbarros pushed a commit that referenced this issue Aug 21, 2023
Fixing Bug Report #1384
Dataset with categorical features causes memory error even on tiny dataset.
@fabclmnt fabclmnt moved this to In Progress in YData-profiling roadmap Aug 24, 2023
@fabclmnt fabclmnt moved this from In Progress to Approval in YData-profiling roadmap Aug 24, 2023
@fabclmnt fabclmnt changed the title Bug Report Dataset with categorical features causes memory error even on tiny dataset. Aug 24, 2023
aquemy pushed a commit that referenced this issue Oct 10, 2023
Fixing Bug Report #1384
Dataset with categorical features causes memory error even on tiny dataset.
aquemy pushed a commit that referenced this issue Oct 10, 2023
Fixing Bug Report #1384
Dataset with categorical features causes memory error even on tiny dataset.
fabclmnt added a commit that referenced this issue Dec 7, 2023
* Update duplicates_pandas.py (#1427)

Fixing Bug Report #1384
Dataset with categorical features causes memory error even on tiny dataset.

* chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1

* chore(actions): update actions/checkout action to v4

* docs: setup new docs with mkdocs (#1418)

* chore(actions): update actions/checkout action to v4

* fix: remove the duplicated cardinality threshold under categorical and text settings

* fix: fixate matplotlib upper version

* docs: change from `zap` to `sparkles` (#1447)

Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com>

* fix: template {{ file_name }} error in HTML wrapper (#1380)

* Update javascript.html

* Update style.html

* feat: add density histogram (#1458)

* feat: add histogram density option

* test: add unit test

* fix: discard weights if exceed max_bins

* docs: update README.html (#1461)

Update url of use cases, main integrations, and common issues.

* fix: bug when creating a new report (#1440)

* fix: gen wordcloud only for non-empty cols (#1459)

* fix: table template ignoring text format (#1462)

* fix: table template ignoring text format

* fix: timeseries unit test

* fix(linting): code formatting

---------

Co-authored-by: Azory YData Bot <azory@ydata.ai>

* fix: to_category misshandling pd.NA (#1464)

* docs: add 📊 for Key features (#1451)

See also #1445 (comment)

* docs: fix hyperlink - related to package name change (#1457)

Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com>

* chore(deps): increase numpy upper limit (#1467)

* chore(deps): increase numpy upper limit

* chore(deps): fixate numpy version for spark

* chore(deps): fix numba package version, and filter warns (#1468)

* chore: fix numba package version, and filter warns

* fix: skip isort linter on init

* chore(deps): update dependency typeguard to v4 (#1324)

* chore(deps): update dependency typeguard to v4

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com>

* docs: update docs with advent of code

* docs: update links for fabric

---------

Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai>
Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai>
Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com>
Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com>
Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com>
Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com>
Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com>
Co-authored-by: Azory YData Bot <azory@ydata.ai>
Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com>
Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com>
Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com>
Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>
fabclmnt added a commit that referenced this issue Dec 7, 2023
* Update duplicates_pandas.py (#1427)

Fixing Bug Report #1384
Dataset with categorical features causes memory error even on tiny dataset.

* chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1

* chore(actions): update actions/checkout action to v4

* docs: setup new docs with mkdocs (#1418)

* chore(actions): update actions/checkout action to v4

* fix: remove the duplicated cardinality threshold under categorical and text settings

* fix: fixate matplotlib upper version

* docs: change from `zap` to `sparkles` (#1447)

Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com>

* fix: template {{ file_name }} error in HTML wrapper (#1380)

* Update javascript.html

* Update style.html

* feat: add density histogram (#1458)

* feat: add histogram density option

* test: add unit test

* fix: discard weights if exceed max_bins

* docs: update README.html (#1461)

Update url of use cases, main integrations, and common issues.

* fix: bug when creating a new report (#1440)

* fix: gen wordcloud only for non-empty cols (#1459)

* fix: table template ignoring text format (#1462)

* fix: table template ignoring text format

* fix: timeseries unit test

* fix(linting): code formatting

---------

Co-authored-by: Azory YData Bot <azory@ydata.ai>

* fix: to_category misshandling pd.NA (#1464)

* docs: add 📊 for Key features (#1451)

See also #1445 (comment)

* docs: fix hyperlink - related to package name change (#1457)

Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com>

* chore(deps): increase numpy upper limit (#1467)

* chore(deps): increase numpy upper limit

* chore(deps): fixate numpy version for spark

* chore(deps): fix numba package version, and filter warns (#1468)

* chore: fix numba package version, and filter warns

* fix: skip isort linter on init

* chore(deps): update dependency typeguard to v4 (#1324)

* chore(deps): update dependency typeguard to v4

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com>

* docs: update docs with advent of code

* docs: update links for fabric

* chore(actions): update actions/setup-python action to v5

* docs: add information about PII classification & management.

---------

Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai>
Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai>
Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com>
Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com>
Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com>
Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com>
Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com>
Co-authored-by: Azory YData Bot <azory@ydata.ai>
Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com>
Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com>
Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com>
Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
Status: Approval
Development

No branches or pull requests

3 participants