[python] avoid data copy where possible #2383

StrikerRUS · 2019-09-05T23:32:31Z

Closed #2380.

StrikerRUS · 2019-09-05T23:35:12Z

python-package/lightgbm/basic.py

@@ -296,7 +296,9 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
            raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n"
                             "Did not expect the data types in the following fields: "
                             + ', '.join(data.columns[bad_indices]))
-        data = data.values.astype('float')
+        data = data.values
+        if data.dtype != np.float32 and data.dtype != np.float64:


Why not simple 'float':

import numpy as np arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype=np.float32) arr.dtype == 'float' # False

StrikerRUS · 2019-09-05T23:40:42Z

@guolinke
I'd like to use precise type names to avoid possible problems. Can you please help to identify: is it int32 or int64?

LightGBM/python-package/lightgbm/basic.py

Lines 2510 to 2511 in 29525ff

    
           if importance_type_int == 0: 
        
               return result.astype(int)

guolinke · 2019-09-06T01:41:31Z

@StrikerRUS I think it is int32, refer to:

LightGBM/src/c_api.cpp

Line 1539 in 317b1bf

int importance_type,

guolinke · 2019-09-06T01:42:42Z

@StrikerRUS
seems related to this PR:
#2384

maybe a better solution is to save the numpy type inside lgb.Dataset?

python-package/lightgbm/basic.py

StrikerRUS · 2019-09-06T13:56:09Z

@guolinke

@StrikerRUS I think it is int32, refer to:

Thanks!

maybe a better solution is to save the numpy type inside lgb.Dataset?

Sorry, didn't get it.

StrikerRUS · 2019-09-06T14:00:16Z

python-package/lightgbm/basic.py

@@ -80,10 +80,7 @@ def list_to_1d_numpy(data, dtype=np.float32, name='list'):
    elif isinstance(data, Series):
        if _get_bad_pandas_dtypes([data.dtypes]):
            raise ValueError('Series.dtypes must be int, float or bool')
-        if hasattr(data.values, 'values'):  # SparseArray


Fix FutureWarning:

FutureWarning: The SparseArray.values attribute is deprecated and will be removed in a future version. You can use `np.asarray(...)` or the `.to_dense()` method instead.

StrikerRUS · 2019-09-06T14:08:38Z

python-package/lightgbm/basic.py

-            return data.values.values.astype(dtype)
-        else:
-            return data.values.astype(dtype)
+        return np.array(data, dtype=dtype, copy=False)  # SparseArray should be supported as well


Dense:

import numpy as np import pandas as pd y = pd.Series([0., 1., 2., 3.]) id(y.values) == id(np.array(y, copy=False)) # True

Sparse:

import numpy as np import pandas as pd y = pd.Series(pd.SparseArray([0., 1., 2., 3.])) id(y.values.values) == id(np.array(y, copy=False)) # True

StrikerRUS · 2019-09-06T14:42:28Z

python-package/lightgbm/basic.py

@@ -311,7 +313,9 @@ def _label_from_pandas(label):
            raise ValueError('DataFrame for label cannot have multiple columns')
        if _get_bad_pandas_dtypes(label.dtypes):
            raise ValueError('DataFrame.dtypes for label must be int, float or bool')
-        label = label.values.astype('float').flatten()
+        label = np.ravel(label.values)


ravel vs flatten: https://stackoverflow.com/a/28930580

StrikerRUS · 2019-09-12T12:04:21Z

@jameslamb Would you like to take a look at this and submit your review as you have already dug into this?

jameslamb

Left one comment, otherwise looks good to me!

jameslamb · 2019-09-12T22:28:00Z

python-package/lightgbm/basic.py

        else:
-            return data.values.astype(dtype)
+            return np.array(data, dtype=dtype, copy=False)  # SparseArray should be supported as well


Whenever I see a comment like this, I think "that should definitely be a unit test". Do we already have a unit test in the Python package that covers that case where the input data is a sparse array? cd py-pkg && git grep -i sparse didn't return any test files.

@StrikerRUS I think either this test should be added as part of accepting this PR or we should create an issue documenting it. Otherwise we may reintroduce a bug in the behavior in the future without knowing it. What do you think?

@jameslamb

Do we already have a unit test in the Python package that covers that case where the input data is a sparse array?

Yep! Even 2! 😄

LightGBM/tests/python_package_test/test_engine.py

Lines 716 to 740 in 1556642

@unittest.skipIf(not lgb.compat.PANDAS_INSTALLED, 'pandas is not installed')

def test_pandas_sparse(self):

import pandas as pd

X = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 1, 2] * 100)),

"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1, 0.2] * 60)),

"C": pd.SparseArray(np.random.permutation([True, False] * 150))})

y = pd.Series(pd.SparseArray(np.random.permutation([0, 1] * 150)))

X_test = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 2] * 30)),

"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1] * 15)),

"C": pd.SparseArray(np.random.permutation([True, False] * 30))})

if pd.__version__ >= '0.24.0':

for dtype in pd.concat([X.dtypes, X_test.dtypes, pd.Series(y.dtypes)]):

self.assertTrue(pd.api.types.is_sparse(dtype))

params = {

'objective': 'binary',

'verbose': -1

}

lgb_train = lgb.Dataset(X, y)

gbm = lgb.train(params, lgb_train, num_boost_round=10)

pred_sparse = gbm.predict(X_test, raw_score=True)

if hasattr(X_test, 'sparse'):

pred_dense = gbm.predict(X_test.sparse.to_dense(), raw_score=True)

else:

pred_dense = gbm.predict(X_test.to_dense(), raw_score=True)

np.testing.assert_allclose(pred_sparse, pred_dense)

LightGBM/tests/python_package_test/test_sklearn.py

Lines 280 to 299 in 1556642

@unittest.skipIf(not lgb.compat.PANDAS_INSTALLED, 'pandas is not installed')

def test_pandas_sparse(self):

import pandas as pd

X = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 1, 2] * 100)),

"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1, 0.2] * 60)),

"C": pd.SparseArray(np.random.permutation([True, False] * 150))})

y = pd.Series(pd.SparseArray(np.random.permutation([0, 1] * 150)))

X_test = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 2] * 30)),

"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1] * 15)),

"C": pd.SparseArray(np.random.permutation([True, False] * 30))})

if pd.__version__ >= '0.24.0':

for dtype in pd.concat([X.dtypes, X_test.dtypes, pd.Series(y.dtypes)]):

self.assertTrue(pd.api.types.is_sparse(dtype))

gbm = lgb.sklearn.LGBMClassifier().fit(X, y)

pred_sparse = gbm.predict(X_test, raw_score=True)

if hasattr(X_test, 'sparse'):

pred_dense = gbm.predict(X_test.sparse.to_dense(), raw_score=True)

else:

pred_dense = gbm.predict(X_test.to_dense(), raw_score=True)

np.testing.assert_allclose(pred_sparse, pred_dense)

Ha oh ok 😊 Tells you how much I have worked on the Python side of the project...I assumed tests would be in python-package (that is how many Python packages are set up), not in a separate folder branched off the repo root. That's what I get for blindly git grep-ing without actually looking at the repo. I am ashamed haha

Thanks!

approved ✅

Tests had already been there when I joined the project. So, I can't say much about it... I guess, the rule about tests inside the package is crucial for mono-language repos.

StrikerRUS · 2019-09-13T20:05:46Z

python-package/lightgbm/basic.py

-        if hasattr(data.values, 'values'):  # SparseArray
-            return data.values.values.astype(dtype)
+        if data.dtype == np.float32 or data.dtype == np.float64:
+            return np.array(data, dtype=data.dtype, copy=False)


@guolinke According to this note

LightGBM/include/LightGBM/c_api.h

Lines 310 to 315 in 1556642

* \brief Set vector to a content in info.

* \note

* - \a monotone_constraints only works for ``C_API_DTYPE_INT8``;

* - \a group only works for ``C_API_DTYPE_INT32``;

* - \a label and \a weight only work for ``C_API_DTYPE_FLOAT32``;

* - \a init_score and \a feature_penalty only work for ``C_API_DTYPE_FLOAT64``.

it seems that we cannot allow these memory savings, right?

sorry for missing this comment.
yes, the field in dataset don't allow arbitrary data types.

I think I could have a parameter, maybe called new_type with default value None. And when it is not none, convert to the new type if it is not that type.

@guolinke Seems we always need to force the type, except training/test data, but it's not going through this function. So, I can simply remove that if statement.

ping @guolinke

avoid copy where possible

8071146

StrikerRUS commented Sep 5, 2019

View reviewed changes

StrikerRUS requested review from jameslamb, henry0312 and guolinke September 5, 2019 23:35

dnouri mentioned this pull request Sep 6, 2019

[Py] In _data_from_pandas, avoid copying df.values if it's already float #2380

Closed

dnouri reviewed Sep 6, 2019

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

StrikerRUS added 4 commits September 6, 2019 16:10

fixed conflicts

8d0d302

use precise type for importance type

2f1bd57

removed pointless code

ce37741

simplify sparse pandas Series conversion

7c82f58

StrikerRUS commented Sep 6, 2019

View reviewed changes

more memory savings

ed7e3af

StrikerRUS commented Sep 6, 2019

View reviewed changes

StrikerRUS mentioned this pull request Sep 10, 2019

v2.3.0 realese #2138

Merged

guolinke approved these changes Sep 12, 2019

View reviewed changes

jameslamb approved these changes Sep 12, 2019

View reviewed changes

StrikerRUS commented Sep 13, 2019

View reviewed changes

StrikerRUS mentioned this pull request Sep 16, 2019

lightgbm memory spike #2404

Closed

always force type conversion for 1-D arrays

14721e2

StrikerRUS requested a review from guolinke September 20, 2019 13:53

guolinke approved these changes Sep 25, 2019

View reviewed changes

one more copy=False

93897f3

StrikerRUS requested a review from chivee as a code owner September 25, 2019 12:03

StrikerRUS requested a review from wxchan as a code owner September 25, 2019 12:03

StrikerRUS added 2 commits September 25, 2019 20:10

Merge remote-tracking branch 'origin/master' into numpy_copy

b90da65

Merge remote-tracking branch 'origin/master' into numpy_copy

5a13092

StrikerRUS merged commit d064019 into master Sep 26, 2019

StrikerRUS deleted the numpy_copy branch September 26, 2019 20:38

lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] avoid data copy where possible #2383

[python] avoid data copy where possible #2383

StrikerRUS commented Sep 5, 2019

StrikerRUS Sep 5, 2019

StrikerRUS commented Sep 5, 2019 •

edited

Loading

guolinke commented Sep 6, 2019

guolinke commented Sep 6, 2019

StrikerRUS commented Sep 6, 2019

StrikerRUS Sep 6, 2019

StrikerRUS Sep 6, 2019

StrikerRUS Sep 6, 2019

StrikerRUS commented Sep 12, 2019

jameslamb left a comment

jameslamb Sep 12, 2019

StrikerRUS Sep 12, 2019

jameslamb Sep 13, 2019

jameslamb Sep 13, 2019

StrikerRUS Sep 13, 2019

StrikerRUS Sep 13, 2019

guolinke Sep 19, 2019

guolinke Sep 19, 2019

StrikerRUS Sep 19, 2019

StrikerRUS Sep 24, 2019

	@unittest.skipIf(not lgb.compat.PANDAS_INSTALLED, 'pandas is not installed')
	def test_pandas_sparse(self):
	import pandas as pd
	X = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 1, 2] * 100)),
	"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1, 0.2] * 60)),
	"C": pd.SparseArray(np.random.permutation([True, False] * 150))})
	y = pd.Series(pd.SparseArray(np.random.permutation([0, 1] * 150)))
	X_test = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 2] * 30)),
	"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1] * 15)),
	"C": pd.SparseArray(np.random.permutation([True, False] * 30))})
	if pd.__version__ >= '0.24.0':
	for dtype in pd.concat([X.dtypes, X_test.dtypes, pd.Series(y.dtypes)]):
	self.assertTrue(pd.api.types.is_sparse(dtype))
	params = {
	'objective': 'binary',
	'verbose': -1
	}
	lgb_train = lgb.Dataset(X, y)
	gbm = lgb.train(params, lgb_train, num_boost_round=10)
	pred_sparse = gbm.predict(X_test, raw_score=True)
	if hasattr(X_test, 'sparse'):
	pred_dense = gbm.predict(X_test.sparse.to_dense(), raw_score=True)
	else:
	pred_dense = gbm.predict(X_test.to_dense(), raw_score=True)
	np.testing.assert_allclose(pred_sparse, pred_dense)

	* \brief Set vector to a content in info.
	* \note
	* - \a monotone_constraints only works for ``C_API_DTYPE_INT8``;
	* - \a group only works for ``C_API_DTYPE_INT32``;
	* - \a label and \a weight only work for ``C_API_DTYPE_FLOAT32``;
	* - \a init_score and \a feature_penalty only work for ``C_API_DTYPE_FLOAT64``.

[python] avoid data copy where possible #2383

[python] avoid data copy where possible #2383

Conversation

StrikerRUS commented Sep 5, 2019

Choose a reason for hiding this comment

StrikerRUS commented Sep 5, 2019 • edited Loading

guolinke commented Sep 6, 2019

guolinke commented Sep 6, 2019

StrikerRUS commented Sep 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dense:

Sparse:

Choose a reason for hiding this comment

StrikerRUS commented Sep 12, 2019

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS commented Sep 5, 2019 •

edited

Loading