Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFTimeIndex Resampling #2593

Merged
merged 72 commits into from
Feb 3, 2019
Merged

CFTimeIndex Resampling #2593

merged 72 commits into from
Feb 3, 2019

Conversation

jwenfai
Copy link
Contributor

@jwenfai jwenfai commented Dec 5, 2018

I would appreciate some feedback on why this implementation for CFTimeIndex resampling doesn't match pandas' output 100%.

  • Tentative attempt at addressing Adding resample functionality to CFTimeIndex #2191 (resampling CFTimeIndex). Downsampling produces results that match pandas' results in all the tests done thus far. However, upsampling has trouble assigning the right values to the right bins.
  • Tests (test_cftimeindex_resample.py) created for standard calendars but not for non-standard calendars (360 days etc.). Contents from test_cftimeindex_resample.py will be inserted into test_dataarray.py and test_cftimeindex_resample.py will be deleted once resampling implementation is finalized. Files found in the tests/temp folder is meant to highlight resampling discrepancies between pandas and this implementaion. Both the files and the folder will be removed once resampling implementation is finalized.
  • Not fully documented. Docstrings have yet to be added to resample_cftime.py. There were no doctrings from https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py (which is where the codes were ported from) that could be conveniently copied.

@pep8speaks
Copy link

pep8speaks commented Dec 5, 2018

Hello @jwenfai! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 02, 2019 at 20:46 Hours UTC

list(itertools.product(
['left', 'right'],
['left', 'right'],
['2MS', '2M', '3MS', '3M', '7MS', '7M'])))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief note on the test approach - overall looks great - on these items, you can put each set of params in its own parametrize decorator, and avoid making the products yourself

(no need to change this time though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, I made the change, much better than the torturous code I was writing before!

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwenfai I'm excited to see some progress on this! I may be a bit slow these next few days, but I'll try to provide some more extensive feedback soon.

xarray/tests/temp/cftime_resample_pandas_comparison.py Outdated Show resolved Hide resolved
Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwenfai thanks for your patience; I was away at a conference all of last week. Here are some initial comments on this PR, which is a very good start.

I'll probably need a few more passes on this, particularly to understand the issues with the upsampling portion.


def _get_time_bins(index, freq, closed, label, base):
# This portion of code comes from TimeGrouper __init__ #
end_types = {'M', 'A'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it might be cleaner to separate some of this logic into another method. For example define:

def _default_closed_or_label(freq):
    if freq._freq in {'M', 'A'}:
        return 'right'
    else:
        return 'left'

Then within _get_time_bins you could use something like this:

if closed is None:
    closed = _default_closed_or_label(freq)

if label is None:
    label = _default_closed_or_label(freq)


def _adjust_bin_edges(binner, ax_values, freq):
# Some hacks for > daily data, see #1471, #1458, #1483
if freq._freq not in ['D', 'H', 'T', 'min', 'S']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a CFTIME_TICKS variable in cftime_offsets.py that we could import and use for these instance checks.

# pandas defines these offsets as "Tick" objects, which for instance have 
# distinct behavior from monthly or longer frequencies in resample.
CFTIME_TICKS = (Day, Hour, Minute, Second)

Then here and in other methods this check could just be something like if not isinstance(freq, CFTIME_TICKS) or if isinstance(freq, CFTIME_TICKS).

return fresult, lresult


def _offset_timedelta(offset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to add an as_timedelta method (where applicable) to the cftime offset objects. For example:

class Day(BaseCFTimeOffset):
    _freq = 'D'

    def as_timedelta(self):
        return timedelta(days=self.n)

    def __apply__(self, other):
        return other + self.as_timedelta()

That way we would not need this method to do the translation.

base = base % offset.n
start_day = normalize_date(first)
base_td = datetime.timedelta(0)
if offset._freq == 'D':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we defined an as_timedelta method for cftime offsets (as described below), we could instead replace this conditional block with one line:

base_td = type(offset)(n=base).as_timedelta()

return binner, labels


def _adjust_bin_edges(binner, ax_values, freq):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not appear to be used in this implementation. Should it be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also noticed I've been too slavishly copying pandas logic. This and some other unnecessary code have been/will be removed.

['left', 'right'],
['2MS', '2M', '3MS', '3M', '7MS', '7M'])))
def test_downsampler(closed, label, freq):
downsamp_series = series(pd_index()).resample(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like things might be more apples to apples in this test if we compared the results of resampling a DataArray indexed using a DatetimeIndex to the results of resampling a DataArray indexed using a CFTimeIndex (with a standard calendar type).

This would also allow us to make use of xarray's testing methods, like xarray.testing.assert_equal, which checks the equivalence of two DataArrays (including the equivalence of their coordinates, and NaN placement).

# from ..coding.cftimeindex import CFTimeIndex
import cftime as cf
import numpy as np
if isinstance(self._obj[self._dim].values[0], cf.datetime):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like if isinstance(self._obj.indexes[self._dim], CFTimeIndex) might be cleaner here.

import numpy as np
if isinstance(self._obj[self._dim].values[0], cf.datetime):
t = self._obj[self._dim]
x = np.insert([td.total_seconds() for td in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just make use of xarray.core.utils.datetime_to_numeric here, e.g.

x = datetime_to_numeric(t, datetime_unit='s')

def test_downsampler(closed, label, freq):
downsamp_series = series(pd_index()).resample(
freq, closed=closed, label=label).mean().dropna()
downsamp_da = da(xr_index()).resample(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest fixtures are typically provided as arguments to the test functions, rather than called directly.


@pytest.fixture()
def xr_index():
return xr.cftime_range('2000-01-01', periods=30, freq='MS', tz='UTC')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cftime_range does not support a tz option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove that option but someone needs to fix this misleading doc

http://xarray.pydata.org/en/stable/generated/xarray.cftime_range.html
xarray.cftime_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None, calendar='standard')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that! Indeed that signature is misleading; see #2613 for a fix.

@jwenfai
Copy link
Contributor Author

jwenfai commented Dec 17, 2018

@spencerkclark Thanks for the detailed review! I'll fix up my code over the next few days.

I haven't completely solved the upsampling issue yet but I think I might have some clues as to what's happening. Timedelta operations on cftime.datetime does not always return correct values. Sometimes, they are a few microseconds or one second off.

The issue can be sidestepped by shifting the the bins 1 second forward for closed=='right' and 1 second back for closed=='left' in groupby.py, but this obviously introduces issues for resampling operations at the second and microsecond resolution. This workaround doesn't pass all the tests. An extra time bin is still sometimes created. You'll see what I mean when I make a new commit sometime next week.

… 2018-12-05 @max-sixty GitHub reviews for resample-v2-clean pull request). Not cleaned.
Get PEP8 changes from Ouranosinc.
…-v2-clean

# Conflicts:
#	xarray/tests/test_cftimeindex_resample.py
… 2018-12-05 @max-sixty GitHub reviews for resample-v2-clean pull request). Cleaned.
# Conflicts:
#	xarray/core/resample.py
#	xarray/tests/test_cftimeindex_resample.py
@spencerkclark
Copy link
Member

@jwenfai thanks for the updates. It looks like there are some merge conflicts that are preventing our CI from running. Could you please resolve those when you get chance, so we can see those results?

 Moved full_index and first_items generation logic to a helper function
@fmaussion
Copy link
Member

I can't diagnose what's wrong from the error message (something to do with conda it seems)

Some connection error. I restarted Travis, let's see if this happens again.

@shoyer
Copy link
Member

shoyer commented Feb 1, 2019

It looks like tests are passing now. I'm going to give this another look over and then (probably) merge

@@ -259,11 +259,8 @@ def __init__(self, obj, group, squeeze=False, grouper=None, bins=None,
# TODO: sort instead of raising an error
raise ValueError('index must be monotonic for resampling')
s = pd.Series(np.arange(index.size), index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this object s inside the helper function instead? It's not needed outside here

@@ -1,6 +1,8 @@
# -*- coding: utf-8 -*-
from textwrap import dedent

from textwrap import dedent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like changing from another PR have leaked in here? Let's try to figure that out...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to have accidentally crept in when I merged changes from pydata/master into my local repo 4~5 days back. Here's what I managed to trace (from latest to earliest instance of test_formatting.py being changed):
Ouranosinc#14
jwenfai@31ccebf
jwenfai@9fbb016

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it’s just a bad merge — these tests are now duplicated twice. You can simply delete the redundant code and push a new commit.

@@ -189,6 +189,53 @@ def test_attribute_repr(self):
assert '\n' not in newlines
assert '\t' not in tabs

def test_diff_dataset_repr(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to delete this repeated method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, didn't know how I missed it.

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved
if isinstance(grouper, CFTimeGrouper):
first_items = grouper.first_items(index)
full_index = first_items.index
if first_items.isnull().any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you merge in master against, you could switch this block back to using Series.dropna().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@shoyer
Copy link
Member

shoyer commented Feb 2, 2019

I'm going to merge this when tests pass

@spencerkclark
Copy link
Member

Sounds good @shoyer, thanks for bringing this to the finish line.

@jwenfai
Copy link
Contributor Author

jwenfai commented Feb 2, 2019

All tests passed. Thanks, @spencerkclark and @shoyer, for all the help!

@shoyer shoyer merged commit d8ff079 into pydata:master Feb 3, 2019
@shoyer
Copy link
Member

shoyer commented Feb 3, 2019

thanks @jwenfai and @spencerkclark !

dcherian pushed a commit to yohai/xarray that referenced this pull request Feb 4, 2019
* master:
  remove xfail from test_cross_engine_read_write_netcdf4 (pydata#2741)
  Reenable cross engine read write netCDF test (pydata#2739)
  remove bottleneck dev build from travis, this test env was failing to build (pydata#2736)
  CFTimeIndex Resampling (pydata#2593)
  add tests for handling of empty pandas objects in constructors (pydata#2735)
  dropna() for a Series indexed by a CFTimeIndex (pydata#2734)
  deprecate compat & encoding (pydata#2703)
  Implement integrate (pydata#2653)
  ENH: resample methods with tolerance (pydata#2716)
  improve error message for invalid encoding (pydata#2730)
  silence a couple of warnings (pydata#2727)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants