-
Notifications
You must be signed in to change notification settings - Fork 62
Series combine #821
base: master
Are you sure you want to change the base?
Series combine #821
Conversation
|
||
len_val = max(len(self), len(other)) | ||
result = numpy.empty(len_val, self._data.dtype) | ||
for ind in range(len_val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we parallel the method based on chunks?
if fill_value is None: | ||
fill_value = numpy.nan | ||
|
||
len_val = max(len(self), len(other)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if all indexes are different? I think we should use sdc_join_series_indexes to find len of result series
|
||
len_val = max(len(self), len(other)) | ||
result = numpy.empty(len_val, self._data.dtype) | ||
for ind in range(len_val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is case for non-indexes series. Also, it should rewrite with prange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+
usage of chunks to predict scalability
examples/series/series_combine.py
Outdated
|
||
|
||
@njit | ||
def series_copy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong name function
if fill_value is None: | ||
fill_value = numpy.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make fill_value type undefined at compile time. You can probably use the same approach as in operators:
sdc/sdc/sdc_function_templates.py
Line 144 in e87095a
_fill_value = numpy.nan if fill_value_is_none == True else fill_value # noqa |
fill_value = numpy.nan | ||
|
||
len_val = max(len(self), len(other)) | ||
result = numpy.empty(len_val, self._data.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually wrong, result dtype should be common dtype for result dtype of func(a, b) where a,b are series values and dtype of _fill_value. Provided tests do not cover this, but e.g. this (where fill_value is float and series are integers) won't pass:
def test_series_combine_integer_new(self):
def test_impl(S1, S2):
return S1.combine(S2, lambda a, b: 2 * a + b, 16.2)
hpat_func = self.jit(test_impl)
S1 = pd.Series([1, 2, 3, 4, 5])
S2 = pd.Series([6, 21, 3, 5])
result = hpat_func(S1, S2)
result_ref = test_impl(S1, S2)
print(f"DEBUG: result:\n{result},\nresult_ref:\n{result_ref}")
pd.testing.assert_series_equal(result, result_ref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add deducing result dtype and add parallelization.
sdc/tests/test_series.py
Outdated
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) | ||
|
||
@skip_numba_jit | ||
@unittest.expectedFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment why the test is skipped.
@unittest.expectedFailure # ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa No need to skip the test if that's how impl is intended to work. Use check_dtype=False in assert_series_equal and add a comment just before this check to refer to SDC Limitation.
if self_indexes[j] == -1: | ||
val_self = _fill_value | ||
else: | ||
ind_self = self_indexes[j] | ||
val_self = self[ind_self]._data[0] | ||
|
||
if other_indexes[j] == -1: | ||
val_other = _fill_value | ||
else: | ||
ind_other = other_indexes[j] | ||
val_other = other[ind_other]._data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self_indexes[j] == -1: | |
val_self = _fill_value | |
else: | |
ind_self = self_indexes[j] | |
val_self = self[ind_self]._data[0] | |
if other_indexes[j] == -1: | |
val_other = _fill_value | |
else: | |
ind_other = other_indexes[j] | |
val_other = other[ind_other]._data[0] | |
self_idx = self_indexes[j] | |
if self_idx == -1: | |
val_self = _fill_value | |
else: | |
val_self = self[self_idx]._data[0] | |
other_idx = other_indexes[j] | |
if other_idx == -1: | |
val_other = _fill_value | |
else: | |
val_other = other[other_idx]._data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or
if self_indexes[j] == -1: | |
val_self = _fill_value | |
else: | |
ind_self = self_indexes[j] | |
val_self = self[ind_self]._data[0] | |
if other_indexes[j] == -1: | |
val_other = _fill_value | |
else: | |
ind_other = other_indexes[j] | |
val_other = other[ind_other]._data[0] | |
self_idx = self_indexes[j] | |
val_self = _fill_value if self_idx == -1 else self[self_idx]._data[0] | |
other_idx = other_indexes[j] | |
val_other = _fill_value if other_idx == -1 else other[other_idx]._data[0] |
min_periods = 2 | ||
|
||
if min_periods < 2: | ||
min_periods = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if fill_value is not None: | |
_fill_value = numpy.nan if fill_value is None else fill_value: |
|
||
if not isinstance(fill_value, (types.Omitted, types.NoneType, types.Number)) and fill_value is not None: | ||
ty_checker.raise_exc(fill_value, 'number', 'fill_value') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self_idx (and other_idx) is position in the Series, not the index, so instead of using getitem on a Series, that performs index lookup and returns a Series, so that you have to take _data[0] from it, you can just write:
val_self = self[self_idx]._data[0] | |
val_self = self._data[self_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks OK, but I would reorganize and extend existing tests a bit.
Limitations | ||
----------- | ||
- Only supports the case when data in series of the same type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not correct - impl handles all cases. For the next line we need exact definition of difference to pandas, e.g:
- Only supports the case when data in series of the same type. | |
- Resulting series dtype may be wider than in pandas due to type-stability requirements and depends on fill_value dtype and result of series indexes alignment. |
sdc/tests/test_series.py
Outdated
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) | ||
|
||
@skip_numba_jit | ||
def test_series_combine_float3264(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has incorrect code, which should be corrected probably:
S1 = pd.Series([np.float64(1), np.float64(2),
np.float64(3), np.float64(4), np.float64(5)])
S2 = pd.Series([np.float32(1), np.float32(2),
np.float32(3), np.float32(4), np.float32(5)])
S2.dtype will be float64 on Win, not float32. Moreover, series dtype should be specified this way:
S1 = pd.Series([1, 2, 3, 4, 5], dtype=np.int64)
S2 = pd.Series([1, 2, 3, 4, 5], dtype=np.int32)
chunks = parallel_chunks(len_val) | ||
for i in prange(len(chunks)): | ||
chunk = chunks[i] | ||
for j in range(chunk.start, chunk.stop): | ||
self_idx = self_indexes[j] | ||
val_self = _fill_value if self_idx == -1 else self._data[self_idx] | ||
|
||
other_idx = other_indexes[j] | ||
val_other = _fill_value if other_idx == -1 else other._data[other_idx] | ||
|
||
result[j] = func(val_self, val_other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
chunks = parallel_chunks(len_val) | |
for i in prange(len(chunks)): | |
chunk = chunks[i] | |
for j in range(chunk.start, chunk.stop): | |
self_idx = self_indexes[j] | |
val_self = _fill_value if self_idx == -1 else self._data[self_idx] | |
other_idx = other_indexes[j] | |
val_other = _fill_value if other_idx == -1 else other._data[other_idx] | |
result[j] = func(val_self, val_other) | |
result = numpy.empty(len_val, res_dtype) | |
for i in prange(len(result)): | |
self_idx, other_idx = self_indexes[i], other_indexes[i] | |
val_self = _fill_value if self_idx == -1 else self._data[self_idx] | |
val_other = _fill_value if other_idx == -1 else other._data[other_idx] | |
result[i] = func(val_self, val_other) |
sdc/tests/test_series.py
Outdated
S2 = pd.Series([6., 21., 3., 5.]) | ||
with self.assertRaises(AssertionError): | ||
hpat_func(S1, S2) | ||
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for tests: not all combinations of input series dtypes and fill_value are tested e.g. the one I mentioned before - where float fill_value is assigned to otherwise int series. There are no tests with series with non-default indexes (we refer to samelen, but it's not fully correct - series may have same len, but not same indexes), and no tests for checking func impact on result dtype, so it's hard to see from such tests what's really tested and what is not. So the suggestion is to organize tests in a different manner:
- product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_value_samelen - product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
with fill_value is omitted
Covers: test_series_combine_float3264, test_series_combine_integer_samelen, test_series_combine_samelen, test_series_combine_different_types - product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_integer, test_series_combine_value - product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is omitted
Covers: test_series_combine, test_series_combine_assert1, test_series_combine_assert2, test_series_combine_different_types
New test:
5. (for testing func changes dtype properly)
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value = 0
with diff functions (chaning and not chaning res dtype e.g. preserving int domain, e.g. ** and + and not, e.g. /)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, test 1 can look like this (it can also be split into two: one when we use check_dtype=False and one when we don't):
def test_series_combine_same_index_fill_value(self):
def test_impl(S1, S2):
return S1.combine(S2, lambda a, b: 2 * a + b)
hpat_func = self.jit(test_impl)
n = 11
np.random.seed(0)
A = np.random.randint(-100, 100, n)
B = np.arange(n) * 2 + 1
series_index = 1 + np.arange(n)
series_dtypes = [None, np.int64, np.float64]
fill_values = [None, np.nan, 4, 4.2]
for dtype1, dtype2, fill_value in product(series_dtypes, series_dtypes, fill_values):
S1 = pd.Series(A, index=series_index, dtype=dtype1)
S2 = pd.Series(B, index=series_index, dtype=dtype2)
with self.subTest(S1_dtype=dtype1, S2_dtype=dtype2, fill_value=fill_value):
result = hpat_func(S1, S2)
result_ref = test_impl(S1, S2)
# check_dtype=False due to difference to pandas in some cases
pd.testing.assert_series_equal(result, result_ref, check_dtype=False)
_func_name = 'Method isin().' | ||
|
||
ty_checker = TypeChecker(_func_name) | ||
ty_checker.check(self, SeriesType) | ||
|
||
if not isinstance(values, (types.Set, types.List)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa What? This sounds like a bug...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlov-alexey This 'bug' in function sdc_join_series_indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa Then it should be fixed (please create a JIRA case with a reproducer) before this is merged. Adapting to a bug is no way. @AlexanderKalistratov correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very old PR. It is better to close it.
No description provided.