Series combine by Rubtsowa · Pull Request #821 · IntelPython/sdc

Rubtsowa · 2020-04-28T10:37:17Z

No description provided.

…ies_combine

examples/series/series_combine.py

densmirn · 2020-05-13T09:30:33Z

sdc/datatypes/hpat_pandas_series_functions.py

+
+        len_val = max(len(self), len(other))
+        result = numpy.empty(len_val, self._data.dtype)
+        for ind in range(len_val):


Can we parallel the method based on chunks?

1e-to · 2020-05-13T09:56:34Z

sdc/datatypes/hpat_pandas_series_functions.py

+        if fill_value is None:
+            fill_value = numpy.nan
+
+        len_val = max(len(self), len(other))


And what if all indexes are different? I think we should use sdc_join_series_indexes to find len of result series

1e-to · 2020-05-13T10:20:50Z

sdc/datatypes/hpat_pandas_series_functions.py

+
+        len_val = max(len(self), len(other))
+        result = numpy.empty(len_val, self._data.dtype)
+        for ind in range(len_val):


It is case for non-indexes series. Also, it should rewrite with prange

+ usage of chunks to predict scalability

1e-to · 2020-05-13T10:22:02Z

examples/series/series_combine.py

+
+
+@njit
+def series_copy():


Wrong name function

…ies_combine

kozlov-alexey · 2020-05-21T13:18:51Z

sdc/datatypes/hpat_pandas_series_functions.py

+        if fill_value is None:
+            fill_value = numpy.nan


This will make fill_value type undefined at compile time. You can probably use the same approach as in operators:

sdc/sdc/sdc_function_templates.py

Line 144 in e87095a

_fill_value = numpy.nan if fill_value_is_none == True else fill_value # noqa

kozlov-alexey · 2020-05-21T13:24:59Z

sdc/datatypes/hpat_pandas_series_functions.py

+            fill_value = numpy.nan
+
+        len_val = max(len(self), len(other))
+        result = numpy.empty(len_val, self._data.dtype)


This is actually wrong, result dtype should be common dtype for result dtype of func(a, b) where a,b are series values and dtype of _fill_value. Provided tests do not cover this, but e.g. this (where fill_value is float and series are integers) won't pass:

def test_series_combine_integer_new(self): def test_impl(S1, S2): return S1.combine(S2, lambda a, b: 2 * a + b, 16.2) hpat_func = self.jit(test_impl) S1 = pd.Series([1, 2, 3, 4, 5]) S2 = pd.Series([6, 21, 3, 5]) result = hpat_func(S1, S2) result_ref = test_impl(S1, S2) print(f"DEBUG: result:\n{result},\nresult_ref:\n{result_ref}") pd.testing.assert_series_equal(result, result_ref)

kozlov-alexey

Need to add deducing result dtype and add parallelization.

…ies_combine

densmirn · 2020-05-26T12:59:01Z

sdc/tests/test_series.py

        pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))

-    @skip_numba_jit
+    @unittest.expectedFailure


Please add comment why the test is skipped.
@unittest.expectedFailure # ...

@Rubtsowa No need to skip the test if that's how impl is intended to work. Use check_dtype=False in assert_series_equal and add a comment just before this check to refer to SDC Limitation.

densmirn · 2020-05-26T13:07:34Z

sdc/datatypes/hpat_pandas_series_functions.py

+                if self_indexes[j] == -1:
+                    val_self = _fill_value
+                else:
+                    ind_self = self_indexes[j]
+                    val_self = self[ind_self]._data[0]
+
+                if other_indexes[j] == -1:
+                    val_other = _fill_value
+                else:
+                    ind_other = other_indexes[j]
+                    val_other = other[ind_other]._data[0]


Suggested change

if self_indexes[j] == -1:

val_self = _fill_value

else:

ind_self = self_indexes[j]

val_self = self[ind_self]._data[0]

if other_indexes[j] == -1:

val_other = _fill_value

else:

ind_other = other_indexes[j]

val_other = other[ind_other]._data[0]

self_idx = self_indexes[j]

if self_idx == -1:

val_self = _fill_value

else:

val_self = self[self_idx]._data[0]

other_idx = other_indexes[j]

if other_idx == -1:

val_other = _fill_value

else:

val_other = other[other_idx]._data[0]

or

Suggested change

if self_indexes[j] == -1:

val_self = _fill_value

else:

ind_self = self_indexes[j]

val_self = self[ind_self]._data[0]

if other_indexes[j] == -1:

val_other = _fill_value

else:

ind_other = other_indexes[j]

val_other = other[ind_other]._data[0]

self_idx = self_indexes[j]

val_self = _fill_value if self_idx == -1 else self[self_idx]._data[0]

other_idx = other_indexes[j]

val_other = _fill_value if other_idx == -1 else other[other_idx]._data[0]

kozlov-alexey · 2020-05-27T08:46:40Z

sdc/datatypes/hpat_pandas_series_functions.py

Suggested change

if fill_value is not None:

_fill_value = numpy.nan if fill_value is None else fill_value:

kozlov-alexey · 2020-05-27T08:57:21Z

sdc/datatypes/hpat_pandas_series_functions.py

@@ -4930,22 +4932,43 @@ def sdc_pandas_series_combine(self, other, func, fill_value=None):
    if not isinstance(fill_value, (types.Omitted, types.NoneType, types.Number)) and fill_value is not None:
        ty_checker.raise_exc(fill_value, 'number', 'fill_value')



self_idx (and other_idx) is position in the Series, not the index, so instead of using getitem on a Series, that performs index lookup and returns a Series, so that you have to take _data[0] from it, you can just write:

Suggested change

val_self = self[self_idx]._data[0]

val_self = self._data[self_idx]

…ies_combine

kozlov-alexey

Overall looks OK, but I would reorganize and extend existing tests a bit.

kozlov-alexey · 2020-05-27T10:15:54Z

sdc/datatypes/hpat_pandas_series_functions.py

+
+    Limitations
+    -----------
+    - Only supports the case when data in series of the same type.


This line is not correct - impl handles all cases. For the next line we need exact definition of difference to pandas, e.g:

Suggested change

- Only supports the case when data in series of the same type.

- Resulting series dtype may be wider than in pandas due to type-stability requirements and depends on fill_value dtype and result of series indexes alignment.

kozlov-alexey · 2020-05-27T10:25:40Z

sdc/tests/test_series.py

        pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))

-    @skip_numba_jit
    def test_series_combine_float3264(self):


This test has incorrect code, which should be corrected probably:

S1 = pd.Series([np.float64(1), np.float64(2), np.float64(3), np.float64(4), np.float64(5)]) S2 = pd.Series([np.float32(1), np.float32(2), np.float32(3), np.float32(4), np.float32(5)])

S2.dtype will be float64 on Win, not float32. Moreover, series dtype should be specified this way:

S1 = pd.Series([1, 2, 3, 4, 5], dtype=np.int64) S2 = pd.Series([1, 2, 3, 4, 5], dtype=np.int32)

kozlov-alexey · 2020-05-27T12:08:59Z

sdc/datatypes/hpat_pandas_series_functions.py

+        chunks = parallel_chunks(len_val)
+        for i in prange(len(chunks)):
+            chunk = chunks[i]
+            for j in range(chunk.start, chunk.stop):
+                self_idx = self_indexes[j]
+                val_self = _fill_value if self_idx == -1 else self._data[self_idx]
+
+                other_idx = other_indexes[j]
+                val_other = _fill_value if other_idx == -1 else other._data[other_idx]
+
+                result[j] = func(val_self, val_other)


Why not?

Suggested change

chunks = parallel_chunks(len_val)

for i in prange(len(chunks)):

chunk = chunks[i]

for j in range(chunk.start, chunk.stop):

self_idx = self_indexes[j]

val_self = _fill_value if self_idx == -1 else self._data[self_idx]

other_idx = other_indexes[j]

val_other = _fill_value if other_idx == -1 else other._data[other_idx]

result[j] = func(val_self, val_other)

result = numpy.empty(len_val, res_dtype)

for i in prange(len(result)):

self_idx, other_idx = self_indexes[i], other_indexes[i]

val_self = _fill_value if self_idx == -1 else self._data[self_idx]

val_other = _fill_value if other_idx == -1 else other._data[other_idx]

result[i] = func(val_self, val_other)

kozlov-alexey · 2020-05-27T12:13:27Z

sdc/tests/test_series.py

        S2 = pd.Series([6., 21., 3., 5.])
-        with self.assertRaises(AssertionError):
-            hpat_func(S1, S2)
+        pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))


General comment for tests: not all combinations of input series dtypes and fill_value are tested e.g. the one I mentioned before - where float fill_value is assigned to otherwise int series. There are no tests with series with non-default indexes (we refer to samelen, but it's not fully correct - series may have same len, but not same indexes), and no tests for checking func impact on result dtype, so it's hard to see from such tests what's really tested and what is not. So the suggestion is to organize tests in a different manner:

product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_value_samelen

product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
with fill_value is omitted
Covers: test_series_combine_float3264, test_series_combine_integer_samelen, test_series_combine_samelen, test_series_combine_different_types

product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_integer, test_series_combine_value

product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is omitted
Covers: test_series_combine, test_series_combine_assert1, test_series_combine_assert2, test_series_combine_different_types

New test:
5. (for testing func changes dtype properly)
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value = 0
with diff functions (chaning and not chaning res dtype e.g. preserving int domain, e.g. ** and + and not, e.g. /)

For example, test 1 can look like this (it can also be split into two: one when we use check_dtype=False and one when we don't):

def test_series_combine_same_index_fill_value(self): def test_impl(S1, S2): return S1.combine(S2, lambda a, b: 2 * a + b) hpat_func = self.jit(test_impl) n = 11 np.random.seed(0) A = np.random.randint(-100, 100, n) B = np.arange(n) * 2 + 1 series_index = 1 + np.arange(n) series_dtypes = [None, np.int64, np.float64] fill_values = [None, np.nan, 4, 4.2] for dtype1, dtype2, fill_value in product(series_dtypes, series_dtypes, fill_values): S1 = pd.Series(A, index=series_index, dtype=dtype1) S2 = pd.Series(B, index=series_index, dtype=dtype2) with self.subTest(S1_dtype=dtype1, S2_dtype=dtype2, fill_value=fill_value): result = hpat_func(S1, S2) result_ref = test_impl(S1, S2) # check_dtype=False due to difference to pandas in some cases pd.testing.assert_series_equal(result, result_ref, check_dtype=False)

pep8speaks · 2020-06-01T13:01:44Z

Hello @Rubtsowa! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-01 13:52:44 UTC

…ies_combine

kozlov-alexey · 2020-06-16T12:07:23Z

sdc/datatypes/hpat_pandas_series_functions.py

@Rubtsowa What? This sounds like a bug...

@kozlov-alexey This 'bug' in function sdc_join_series_indexes

@Rubtsowa Then it should be fixed (please create a JIRA case with a reproducer) before this is merged. Adapting to a bug is no way. @AlexanderKalistratov correct?

PokhodenkoSA

Very old PR. It is better to close it.

Rubtsowa added 2 commits April 28, 2020 13:30

Impl Series.combine

7ec05f3

merge

b576e03

Rubtsowa added the Ready for Review label Apr 28, 2020

Rubtsowa requested review from AlexanderKalistratov, PokhodenkoSA and densmirn April 28, 2020 10:37

Rubtsowa added 2 commits April 28, 2020 13:40

change comment

95d233e

Merge branch 'master' of https://github.com/IntelPython/hpat into ser…

2b95e9d

…ies_combine

densmirn suggested changes May 13, 2020

View reviewed changes

change example

2104aae

1e-to suggested changes May 13, 2020

View reviewed changes

Rubtsowa added 3 commits May 13, 2020 15:36

use sdc_join_series_indexes

aeef026

change

f09b792

Merge branch 'master' of https://github.com/IntelPython/hpat into ser…

fafcaed

…ies_combine

kozlov-alexey reviewed May 21, 2020

View reviewed changes

kozlov-alexey suggested changes May 21, 2020

View reviewed changes

kozlov-alexey added Waiting on author and removed Ready for Review labels May 21, 2020

Rubtsowa added 2 commits May 25, 2020 16:54

for from chunks, change dtype for result array

b1ed1b3

Merge branch 'master' of https://github.com/IntelPython/hpat into ser…

4fdb369

…ies_combine

Rubtsowa added Ready for Review and removed Waiting on author labels May 25, 2020

Rubtsowa added 2 commits May 26, 2020 12:22

change 'if' on 'if-else'

6d930d8

change for

c85feff

1e-to approved these changes May 26, 2020

View reviewed changes

Rubtsowa added 2 commits May 26, 2020 13:36

change for

18771f2

change for

8628e31

densmirn reviewed May 26, 2020

View reviewed changes

add comment in skip test, some change in impl

6f37f53

densmirn approved these changes May 26, 2020

View reviewed changes

kozlov-alexey reviewed May 27, 2020

View reviewed changes

Rubtsowa added 3 commits May 27, 2020 12:40

change if-else in 1 line

f66ace3

Merge branch 'master' of https://github.com/IntelPython/hpat into ser…

9fc55a7

…ies_combine

change test

e7dc1f5

kozlov-alexey reviewed May 27, 2020

View reviewed changes

change tests

6c78b8e

Rubtsowa added 2 commits June 1, 2020 16:08

fix problem with PEP8

d623b89

Merge branch 'master' of https://github.com/IntelPython/hpat into ser…

ecd3dba

…ies_combine

kozlov-alexey reviewed Jun 16, 2020

View reviewed changes

PokhodenkoSA reviewed Dec 6, 2021

View reviewed changes

	if fill_value is not None:
	_fill_value = numpy.nan if fill_value is None else fill_value:

		@@ -4930,22 +4932,43 @@ def sdc_pandas_series_combine(self, other, func, fill_value=None):
		if not isinstance(fill_value, (types.Omitted, types.NoneType, types.Number)) and fill_value is not None:
		ty_checker.raise_exc(fill_value, 'number', 'fill_value')

	val_self = self[self_idx]._data[0]
	val_self = self._data[self_idx]

	- Only supports the case when data in series of the same type.
	- Resulting series dtype may be wider than in pandas due to type-stability requirements and depends on fill_value dtype and result of series indexes alignment.

Conversation

Rubtsowa commented Apr 28, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

densmirn May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexey May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Jun 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-06-01 13:52:44 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rubtsowa Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PokhodenkoSA left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

densmirn May 13, 2020 •

edited

Loading

kozlov-alexey May 21, 2020 •

edited

Loading

pep8speaks commented Jun 1, 2020 •

edited

Loading

Rubtsowa Jun 16, 2020 •

edited

Loading