-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#19431] Regression in make_block_same_class (tests failing for new fastparquet release) #19434
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Can you add a test for the internal itself? You can take the small example I gave here (#19431 (comment)) and put it in somewhere in tests/internals/test_internals.py (probably best in the TestBlock class)
pandas/tests/io/test_parquet.py
Outdated
# warns on the coercion | ||
with catch_warnings(record=True): | ||
check_round_trip(df, fp, expected=df.astype('datetime64[ns]')) | ||
check_round_trip(df, fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to do this dependent on the fastparquet version, as older versions still return non-tz data (or we could skip the test for older versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or probably better: check how it is done for Arrow (test_basic
in TestParquetPyarrow), it adds a tz column to the tested dataframe dependent on the version.
Codecov Report
@@ Coverage Diff @@
## master #19434 +/- ##
=========================================
Coverage ? 91.62%
=========================================
Files ? 150
Lines ? 48724
Branches ? 0
=========================================
Hits ? 44642
Misses ? 4082
Partials ? 0
Continue to review full report at Codecov.
|
pandas/core/internals.py
Outdated
@@ -206,7 +206,7 @@ def array_dtype(self): | |||
""" | |||
return self.dtype | |||
|
|||
def make_block(self, values, placement=None, ndim=None): | |||
def make_block(self, values, placement=None, ndim=None, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add this back, but it needs to be deprecated anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then please answer in the original issue how to achieve the same result (see the code example I posted there, to create a new DatetimeTZ block from a numpy array).
But I don't see why this needs to be deprecated. The dtype
argument was actually doing something, in contrast to fastpath
(which we therefore deprecated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp should be using make_block
and not make_block_same_class
pandas/tests/io/test_parquet.py
Outdated
df = pd.DataFrame({'dt': pd.date_range('20130101', periods=3)}) | ||
|
||
# fastparquet supports timezone since 0.1.4, and not before | ||
import fastparquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to do this in a separate test. Have 1 more fp <= 0.1.3 and 1 for fp > 0.1.3 (essentially duplicate the test with 1 working for the old version and 1 for the new)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add 1 build with fp pinned to 0.1.3, the so pin in requirements-2.7.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minggli we can indeed keep a test for the old behaviour (so only run for for older versions), but can you move the new behaviour to test_basic
of this class (so we use the same pattern as in the pyarrow test), that is what I meant with my previous issue.
need to add 1 build with fp pinned to 0.1.3, the so pin in requirements-2.7.sh
@jreback We have one build for fastparquet 0.1.1 (that's the reason the tests were failing here in the original PR), so I would say that is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure.
I am going to merge #19435 before this to fix the CI for now. you should remove the pins (all except the 2.7 one), and this should still pass (after fixes above). |
pandas/tests/io/test_parquet.py
Outdated
@@ -483,10 +488,12 @@ def test_categorical(self, fp): | |||
check_round_trip(df, fp) | |||
|
|||
def test_datetime_tz(self, fp): | |||
# doesn't preserve tz | |||
if LooseVersion(fastparquet.__version__) > LooseVersion('0.1.3'): | |||
pytest.skip("timezone not supported for older fp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rather "timezones supported in newer versions of fp"
pandas/tests/io/test_parquet.py
Outdated
@@ -448,6 +448,11 @@ class TestParquetFastParquet(Base): | |||
def test_basic(self, fp, df_full): | |||
df = df_full | |||
|
|||
# additional supported types for fastparquet >= 0.1.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls follow the existing style. see how we have a fixture for pa_ge_070. you need to make similar for fp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the existing style (see how it is for pyarrow) and what I asked for, so please leave it like this (in this PR).
For the old test we can use a fixture like fp_lt_014
however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
little confused. But I agree it's better to match existing style in test_basic to avoid unnecessary style variation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a totally incorrect style ATM. So please see what you can fix. skipping INSIDE a test function is not correct, rather a fixture should be used. This must have slipped in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeff, please leave this. I asked @minggli to do this, and it is the current style. You can do a follow-up PR if you want to change this in all places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix the 2.7 build (in requirements.sh) to use fastparquet=0.1.3
pandas/tests/io/test_parquet.py
Outdated
|
||
|
||
@pytest.fixture | ||
def fp_ge_014(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you are already using this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. removed.
pandas/tests/io/test_parquet.py
Outdated
@@ -448,9 +466,11 @@ class TestParquetFastParquet(Base): | |||
def test_basic(self, fp, df_full): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make 2 tests here, one for < 0.1.4 and 1 for >
@@ -482,14 +502,15 @@ def test_categorical(self, fp): | |||
df = pd.DataFrame({'a': pd.Categorical(list('abc'))}) | |||
check_round_trip(df, fp) | |||
|
|||
def test_datetime_tz(self, fp): | |||
# doesn't preserve tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same need 2 tests (1 with each fixture)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow is using similar style. there are already two tests for datetime_tz, one < 0.1.4 one >.
add a whatsnew entry in other api changes for this. (say compat with fp >= 0.1.4) |
|
||
def make_block_scalar(self, values): | ||
""" | ||
Create a ScalarBlock | ||
""" | ||
return ScalarBlock(values) | ||
|
||
def make_block_same_class(self, values, placement=None, ndim=None): | ||
def make_block_same_class(self, values, placement=None, ndim=None, | ||
dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deprecate dtype here (add as a FutureWarning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/tests/io/test_parquet.py
Outdated
@@ -448,9 +466,11 @@ class TestParquetFastParquet(Base): | |||
def test_basic(self, fp, df_full): | |||
df = df_full | |||
|
|||
# additional supported types for fastparquet | |||
# additional supported types for fastparquet>=0.1.4 | |||
if LooseVersion(pyarrow.__version__) >= LooseVersion('0.1.4'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrong, you mean fastparquet, yes? This is why the fixtures are FAR preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jeff, I am fine with either however, Joris suggested for now to keep it in line with pyarrow style as per #19434 (comment). I feel it's good to keep it consistent and follow up with another PR for styling for both PyArrow and FP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minggli only the 'pyarrow' -> 'fastparquet' on this line you need to change anyway (copy paste error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose its ok. The problem is, this will be inconsistent with ALL the rest of pandas. You can make a simple change here to fix it. I would appreciate this done here, or ok in a followup . We spend an enormous amount of time making things consistent. adding inconsistent code is not great, even for expediency.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -312,6 +312,7 @@ Other API Changes | |||
- :func:`DatetimeIndex.shift` and :func:`TimedeltaIndex.shift` will now raise ``NullFrequencyError`` (which subclasses ``ValueError``, which was raised in older versions) when the index object frequency is ``None`` (:issue:`19147`) | |||
- Addition and subtraction of ``NaN`` from a :class:`Series` with ``dtype='timedelta64[ns]'`` will raise a ``TypeError` instead of treating the ``NaN`` as ``NaT`` (:issue:`19274`) | |||
- Set operations (union, difference...) on :class:`IntervalIndex` with incompatible index types will now raise a ``TypeError`` rather than a ``ValueError`` (:issue:`19329`) | |||
- Compatibility in :func:`pandas.date_range` with fastparquet==0.1.4 which now supports timezone e.g. ``tz``='US/Eastern'`` (:issue:`19431`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean read_parquet
instead of date_range
?
But, we didn't do anything for this, fastparquet 0.1.4 already works nicely with pandas 0.22, so I don't think this needs a whatsnew notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, I am not sure how this whatsnew notice proposed relates to API catalogue. Do you want me to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, we are fixing something that we changed in the unreleased version of pandas, so users should not be aware of this. So I would remove this, but let's wait what @jreback says
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should however fix the note in io.rst ( the note in http://pandas-docs.github.io/pandas-docs-travis/io.html#parquet) about fastparquet not supporting timezones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off to bed now. thanks guys. hope CI will work normally after this PR.
@jorisvandenbossche @jreback please feedback on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for the make_block_same_class deprecation? (one that assures the warning is raised, and that it is still working correctly)
if dtype is not None: | ||
# issue 19431 fastparquet is passing this | ||
warnings.warn("dtype argument is deprecated, will be removed " | ||
"in a future release.", FutureWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a DeprecationWarning instead of FutureWarning, as it is typically only something developers need to see, not users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post an issue on fastparquet to fix this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let’s leave this as a FutureWarning
it will encourage fp to fix this as it will be visible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the issue I opened yesterday: dask/fastparquet#297
FutureWarning will just be annoying for users in this case, and I am confident fastparquet will change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it’s changed then we can simply move the bar on fp min version and this is no problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not keeping private API around for external packages
this is way too much work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gents, which is it? FutureWarning or DeprecationWarning? it seems to me that DeprecationWarning is meant to be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls leave it as FutureWarning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/tests/io/test_parquet.py
Outdated
@@ -448,9 +457,11 @@ class TestParquetFastParquet(Base): | |||
def test_basic(self, fp, df_full): | |||
df = df_full | |||
|
|||
# additional supported types for fastparquet | |||
# additional supported types for fastparquet>=0.1.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can leave out the ">=0.1.4" as the timedelta is not specific for the newer versions
pandas/core/internals.py
Outdated
@@ -216,20 +216,25 @@ def make_block(self, values, placement=None, ndim=None): | |||
if ndim is None: | |||
ndim = self.ndim | |||
|
|||
return make_block(values, placement=placement, ndim=ndim) | |||
return make_block(values, placement=placement, ndim=ndim, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove dtype here? As fastparquet was not using it (and I don't think anybody else will)
Or otherwise add the same deprecation warning as you did for make_block_same_class
doc/source/io.rst
Outdated
@@ -4537,7 +4537,7 @@ See the documentation for `pyarrow <http://arrow.apache.org/docs/python/>`__ and | |||
.. note:: | |||
|
|||
These engines are very similar and should read/write nearly identical parquet format files. | |||
Currently ``pyarrow`` does not support timedelta data, and ``fastparquet`` does not support timezone aware datetimes (they are coerced to UTC). | |||
Currently ``pyarrow`` does not support timedelta data, ``fastparquet`` now supports timezone aware datetimes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to change. you can't say 'now supports' there is no context here. Simply remove the fp statement (or qualify it with fp >1.4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
with pytest.warns(FutureWarning): | ||
copy = block.make_block_same_class(block.values, | ||
dtype=block.values.dtype) | ||
assert block.dtype == copy.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need these asserts, just running this is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -285,6 +285,14 @@ def test_delete(self): | |||
with pytest.raises(Exception): | |||
newb.delete(3) | |||
|
|||
def test_make_block_same_class(self): | |||
block = create_block('M8[ns, US/Eastern]', [3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
def test_make_block_same_class(self): | ||
# issue 19431 | ||
block = create_block('M8[ns, US/Eastern]', [3]) | ||
with pytest.warns(FutureWarning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tm.assert_produces_warning
instead. we catch this usage of pytest.warns
in the linter and is not standard for pandas codease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@minggli Thanks a lot! (and sorry for the extensive back and forth) |
@jorisvandenbossche no problem. happy to help! |
git diff upstream/master -u -- "*.py" | flake8 --diff
dtype
seems still in use at:https://github.com/minggli/pandas/blob/2f4fc0790a5e5c51eb80bbfadf4c80f0bb424c56/pandas/core/internals.py#L2911