Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST (string dtype): clean-up xpasssing tests with future string dtype #59323

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jul 26, 2024

Follow-up PR on #59320, doing some superficial clean-up of the tests (largely removing some xfails that were actually passing in the meantime)

xref #54792

@jorisvandenbossche jorisvandenbossche added Testing pandas testing functions or related to the test suite Strings String extension data type and string data labels Jul 26, 2024
@jorisvandenbossche jorisvandenbossche force-pushed the string-dtype-tests-initial-cleanup branch from b3b65a3 to dd174ef Compare July 26, 2024 18:17
@jorisvandenbossche jorisvandenbossche changed the title TST / string dtype: clean-up xpasssing tests with future string dtype TST (string dtype): clean-up xpasssing tests with future string dtype Jul 26, 2024
def test_unique_bad_unicode(index_or_series):
# regression test for #34550
uval = "\ud83d" # smiley emoji

obj = index_or_series([uval] * 2)
obj = index_or_series([uval] * 2, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the solution here to add dtype=object? Shouldn't this just work naturally with the inferred string type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work with a string dtype because this test is about "bad unicode". And an actual string dtype cannot represent invalid unicode (at least when using pyarrow under the hood. I assume that our object-dtype based one will be able to hold it).

To keep the spirit of the test (ensure our unique implementation can work with bad unicode in object dtype), I made it explicitly used object dtype.

See also the "Invalid unicode input" section in #59328 (that issue I started yesterday to start record breaking changes / things that are no longer supported with the string dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - also makes me think how we can leverage a BinaryDtype in the future, though that is a different topic for a different day

@@ -360,7 +361,7 @@ def test_info_memory_usage():
df = DataFrame(data)
df.columns = dtypes

df_with_object_index = DataFrame({"a": [1]}, index=["foo"])
df_with_object_index = DataFrame({"a": [1]}, index=Index(["foo"], dtype=object))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment on all of these - might be overlooking something simple but unsure why dtype=object is the solution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because what we are testing here is that if you have object dtype, the full memory usage is not known (and you get this "+"):

In [1]: df = DataFrame({"a": ["a", "b"]})

In [2]: df.info()
<class 'pandas.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       2 non-null      object
dtypes: object(1)
memory usage: 148.0+ bytes

In [3]: pd.options.future.infer_string = True

In [4]: df = DataFrame({"a": ["a", "b"]})

In [5]: df.info()
<class 'pandas.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       2 non-null      string
dtypes: string(1)
memory usage: 150.0 bytes

So 148.0+ bytes vs 150.0 bytes.

Of course I could also update the expected result to be accurate instead of an estimate, but that would 1) complicate the test (since we still have to account for both current and future behaviour), and 2) we still need to test the case of object dtype explicitly anyway.

We should probably add a test specifically for string dtype, though, where we can assert that if you have a proper string dtype the memory is now always the full number and not a lower estimate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - yea would be nice to add a test for exact memory representation with the StringDtype + pyarrow, though can be done separately

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@jorisvandenbossche
Copy link
Member Author

The failures with python-dev / numpy-dev seem unrelated.

@jorisvandenbossche jorisvandenbossche merged commit 9b375be into pandas-dev:main Jul 27, 2024
40 of 46 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-tests-initial-cleanup branch July 27, 2024 15:14
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants