-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset with categorical features causes memory error even on tiny dataset. #1384
Labels
bug 🐛
Something isn't working
Comments
I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan? |
Sure. I will create PR
…On Tue, Aug 8, 2023 at 10:43 PM Alex Barros ***@***.***> wrote:
I was able to replicate even for pandas >= 2. Since you not only found the
issue but also the solution, would you like to submit the PR with the fix
@boris-kogan <https://github.com/boris-kogan>?
—
Reply to this email directly, view it on GitHub
<#1384 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BBJVYEJB45FLW7IDEQLXXHDXUKJFNANCNFSM6AAAAAA2L4SGBI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
boris-kogan
added a commit
to boris-kogan/ydata-profiling
that referenced
this issue
Aug 15, 2023
Fixing Bug Report ydataai#1384 Dataset with categorical features causes memory error even on tiny dataset.
alexbarros
pushed a commit
that referenced
this issue
Aug 21, 2023
Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset.
fabclmnt
changed the title
Bug Report
Dataset with categorical features causes memory error even on tiny dataset.
Aug 24, 2023
aquemy
pushed a commit
that referenced
this issue
Oct 10, 2023
Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset.
aquemy
pushed a commit
that referenced
this issue
Oct 10, 2023
Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset.
fabclmnt
added a commit
that referenced
this issue
Dec 7, 2023
* Update duplicates_pandas.py (#1427) Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset. * chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1 * chore(actions): update actions/checkout action to v4 * docs: setup new docs with mkdocs (#1418) * chore(actions): update actions/checkout action to v4 * fix: remove the duplicated cardinality threshold under categorical and text settings * fix: fixate matplotlib upper version * docs: change from `zap` to `sparkles` (#1447) Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com> * fix: template {{ file_name }} error in HTML wrapper (#1380) * Update javascript.html * Update style.html * feat: add density histogram (#1458) * feat: add histogram density option * test: add unit test * fix: discard weights if exceed max_bins * docs: update README.html (#1461) Update url of use cases, main integrations, and common issues. * fix: bug when creating a new report (#1440) * fix: gen wordcloud only for non-empty cols (#1459) * fix: table template ignoring text format (#1462) * fix: table template ignoring text format * fix: timeseries unit test * fix(linting): code formatting --------- Co-authored-by: Azory YData Bot <azory@ydata.ai> * fix: to_category misshandling pd.NA (#1464) * docs: add 📊 for Key features (#1451) See also #1445 (comment) * docs: fix hyperlink - related to package name change (#1457) Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> * chore(deps): increase numpy upper limit (#1467) * chore(deps): increase numpy upper limit * chore(deps): fixate numpy version for spark * chore(deps): fix numba package version, and filter warns (#1468) * chore: fix numba package version, and filter warns * fix: skip isort linter on init * chore(deps): update dependency typeguard to v4 (#1324) * chore(deps): update dependency typeguard to v4 --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> * docs: update docs with advent of code * docs: update links for fabric --------- Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai> Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai> Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com> Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com> Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com> Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com> Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com> Co-authored-by: Azory YData Bot <azory@ydata.ai> Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com> Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>
fabclmnt
added a commit
that referenced
this issue
Dec 7, 2023
* Update duplicates_pandas.py (#1427) Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset. * chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1 * chore(actions): update actions/checkout action to v4 * docs: setup new docs with mkdocs (#1418) * chore(actions): update actions/checkout action to v4 * fix: remove the duplicated cardinality threshold under categorical and text settings * fix: fixate matplotlib upper version * docs: change from `zap` to `sparkles` (#1447) Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com> * fix: template {{ file_name }} error in HTML wrapper (#1380) * Update javascript.html * Update style.html * feat: add density histogram (#1458) * feat: add histogram density option * test: add unit test * fix: discard weights if exceed max_bins * docs: update README.html (#1461) Update url of use cases, main integrations, and common issues. * fix: bug when creating a new report (#1440) * fix: gen wordcloud only for non-empty cols (#1459) * fix: table template ignoring text format (#1462) * fix: table template ignoring text format * fix: timeseries unit test * fix(linting): code formatting --------- Co-authored-by: Azory YData Bot <azory@ydata.ai> * fix: to_category misshandling pd.NA (#1464) * docs: add 📊 for Key features (#1451) See also #1445 (comment) * docs: fix hyperlink - related to package name change (#1457) Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> * chore(deps): increase numpy upper limit (#1467) * chore(deps): increase numpy upper limit * chore(deps): fixate numpy version for spark * chore(deps): fix numba package version, and filter warns (#1468) * chore: fix numba package version, and filter warns * fix: skip isort linter on init * chore(deps): update dependency typeguard to v4 (#1324) * chore(deps): update dependency typeguard to v4 --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> * docs: update docs with advent of code * docs: update links for fabric * chore(actions): update actions/setup-python action to v5 * docs: add information about PII classification & management. --------- Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai> Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai> Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com> Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com> Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com> Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com> Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com> Co-authored-by: Azory YData Bot <azory@ydata.ai> Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com> Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Current Behaviour
Dataset with categorical features causes memory error even on tiny dataset.
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 439, in _render_json
description = self.description_set
File "/usr/local/lib/python3.9/dist-packages/typeguard/init.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 245, in description_set
self._description_set = describe_df(
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/describe.py", line 151, in describe
metrics, duplicates = progress(get_duplicates, pbar, "Detecting duplicates")(
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/utils/progress_bar.py", line 11, in inner
ret = fn(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/multimethod/init.py", line 315, in call
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/pandas/duplicates_pandas.py", line 37, in pandas_get_duplicates
df[duplicated_rows]
File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 2411, in size
return self._reindex_output(result, fill_value=0)
File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 4143, in _reindex_output
index, _ = MultiIndex.from_product(levels_list, names=names).sortlevel()
File "/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/multi.py", line 643, in from_product
codes = cartesian_product(codes)
File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 60, in cartesian_product
return [
File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 62, in
np.repeat(x, b[i]),
File "<array_function internals>", line 180, in repeat
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 479, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 389. GiB for an array with shape (418190868480,) and data type int8
Expected Behaviour
Current ydata-profile code:
duplicated_rows = (
df[duplicated_rows]
.groupby(supported_columns, dropna=False)
.size()
.reset_index(name=duplicates_key)
)
Should be:
duplicated_rows = (
df[duplicated_rows]
.groupby(supported_columns, dropna=False, observed=True)
.size()
.reset_index(name=duplicates_key)
)
Please pay attention to this issue that explains the solution:
https://github.com/pandas-dev/pandas/issues/30552
Data Description
10x10 random strings pandas dataset, with type specified as Category
Code that reproduces the bug
pandas-profiling version
v4.0.0
Dependencies
OS
ubuntu
Checklist
The text was updated successfully, but these errors were encountered: