-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] avoid data copy where possible #2383
Conversation
@@ -296,7 +296,9 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica | |||
raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n" | |||
"Did not expect the data types in the following fields: " | |||
+ ', '.join(data.columns[bad_indices])) | |||
data = data.values.astype('float') | |||
data = data.values | |||
if data.dtype != np.float32 and data.dtype != np.float64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simple 'float'
:
import numpy as np
arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype=np.float32)
arr.dtype == 'float' # False
@guolinke LightGBM/python-package/lightgbm/basic.py Lines 2510 to 2511 in 29525ff
|
@StrikerRUS I think it is int32, refer to: Line 1539 in 317b1bf
|
@StrikerRUS maybe a better solution is to save the numpy type inside lgb.Dataset? |
Thanks!
Sorry, didn't get it. |
@@ -80,10 +80,7 @@ def list_to_1d_numpy(data, dtype=np.float32, name='list'): | |||
elif isinstance(data, Series): | |||
if _get_bad_pandas_dtypes([data.dtypes]): | |||
raise ValueError('Series.dtypes must be int, float or bool') | |||
if hasattr(data.values, 'values'): # SparseArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix FutureWarning
:
FutureWarning: The SparseArray.values attribute is deprecated and will be removed in a future version. You can use `np.asarray(...)` or the `.to_dense()` method instead.
return data.values.values.astype(dtype) | ||
else: | ||
return data.values.astype(dtype) | ||
return np.array(data, dtype=dtype, copy=False) # SparseArray should be supported as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dense:
import numpy as np
import pandas as pd
y = pd.Series([0., 1., 2., 3.])
id(y.values) == id(np.array(y, copy=False)) # True
Sparse:
import numpy as np
import pandas as pd
y = pd.Series(pd.SparseArray([0., 1., 2., 3.]))
id(y.values.values) == id(np.array(y, copy=False)) # True
python-package/lightgbm/basic.py
Outdated
@@ -311,7 +313,9 @@ def _label_from_pandas(label): | |||
raise ValueError('DataFrame for label cannot have multiple columns') | |||
if _get_bad_pandas_dtypes(label.dtypes): | |||
raise ValueError('DataFrame.dtypes for label must be int, float or bool') | |||
label = label.values.astype('float').flatten() | |||
label = np.ravel(label.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ravel
vs flatten
: https://stackoverflow.com/a/28930580
@jameslamb Would you like to take a look at this and submit your review as you have already dug into this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one comment, otherwise looks good to me!
python-package/lightgbm/basic.py
Outdated
else: | ||
return data.values.astype(dtype) | ||
return np.array(data, dtype=dtype, copy=False) # SparseArray should be supported as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever I see a comment like this, I think "that should definitely be a unit test". Do we already have a unit test in the Python package that covers that case where the input data is a sparse array? cd py-pkg && git grep -i sparse
didn't return any test files.
@StrikerRUS I think either this test should be added as part of accepting this PR or we should create an issue documenting it. Otherwise we may reintroduce a bug in the behavior in the future without knowing it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already have a unit test in the Python package that covers that case where the input data is a sparse array?
Yep! Even 2! 😄
LightGBM/tests/python_package_test/test_engine.py
Lines 716 to 740 in 1556642
@unittest.skipIf(not lgb.compat.PANDAS_INSTALLED, 'pandas is not installed') | |
def test_pandas_sparse(self): | |
import pandas as pd | |
X = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 1, 2] * 100)), | |
"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1, 0.2] * 60)), | |
"C": pd.SparseArray(np.random.permutation([True, False] * 150))}) | |
y = pd.Series(pd.SparseArray(np.random.permutation([0, 1] * 150))) | |
X_test = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 2] * 30)), | |
"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1] * 15)), | |
"C": pd.SparseArray(np.random.permutation([True, False] * 30))}) | |
if pd.__version__ >= '0.24.0': | |
for dtype in pd.concat([X.dtypes, X_test.dtypes, pd.Series(y.dtypes)]): | |
self.assertTrue(pd.api.types.is_sparse(dtype)) | |
params = { | |
'objective': 'binary', | |
'verbose': -1 | |
} | |
lgb_train = lgb.Dataset(X, y) | |
gbm = lgb.train(params, lgb_train, num_boost_round=10) | |
pred_sparse = gbm.predict(X_test, raw_score=True) | |
if hasattr(X_test, 'sparse'): | |
pred_dense = gbm.predict(X_test.sparse.to_dense(), raw_score=True) | |
else: | |
pred_dense = gbm.predict(X_test.to_dense(), raw_score=True) | |
np.testing.assert_allclose(pred_sparse, pred_dense) |
LightGBM/tests/python_package_test/test_sklearn.py
Lines 280 to 299 in 1556642
@unittest.skipIf(not lgb.compat.PANDAS_INSTALLED, 'pandas is not installed') | |
def test_pandas_sparse(self): | |
import pandas as pd | |
X = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 1, 2] * 100)), | |
"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1, 0.2] * 60)), | |
"C": pd.SparseArray(np.random.permutation([True, False] * 150))}) | |
y = pd.Series(pd.SparseArray(np.random.permutation([0, 1] * 150))) | |
X_test = pd.DataFrame({"A": pd.SparseArray(np.random.permutation([0, 2] * 30)), | |
"B": pd.SparseArray(np.random.permutation([0.0, 0.1, 0.2, -0.1] * 15)), | |
"C": pd.SparseArray(np.random.permutation([True, False] * 30))}) | |
if pd.__version__ >= '0.24.0': | |
for dtype in pd.concat([X.dtypes, X_test.dtypes, pd.Series(y.dtypes)]): | |
self.assertTrue(pd.api.types.is_sparse(dtype)) | |
gbm = lgb.sklearn.LGBMClassifier().fit(X, y) | |
pred_sparse = gbm.predict(X_test, raw_score=True) | |
if hasattr(X_test, 'sparse'): | |
pred_dense = gbm.predict(X_test.sparse.to_dense(), raw_score=True) | |
else: | |
pred_dense = gbm.predict(X_test.to_dense(), raw_score=True) | |
np.testing.assert_allclose(pred_sparse, pred_dense) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha oh ok 😊 Tells you how much I have worked on the Python side of the project...I assumed tests would be in python-package
(that is how many Python packages are set up), not in a separate folder branched off the repo root. That's what I get for blindly git grep
-ing without actually looking at the repo. I am ashamed haha
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests had already been there when I joined the project. So, I can't say much about it... I guess, the rule about tests inside the package is crucial for mono-language repos.
python-package/lightgbm/basic.py
Outdated
if hasattr(data.values, 'values'): # SparseArray | ||
return data.values.values.astype(dtype) | ||
if data.dtype == np.float32 or data.dtype == np.float64: | ||
return np.array(data, dtype=data.dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guolinke According to this note
LightGBM/include/LightGBM/c_api.h
Lines 310 to 315 in 1556642
* \brief Set vector to a content in info. | |
* \note | |
* - \a monotone_constraints only works for ``C_API_DTYPE_INT8``; | |
* - \a group only works for ``C_API_DTYPE_INT32``; | |
* - \a label and \a weight only work for ``C_API_DTYPE_FLOAT32``; | |
* - \a init_score and \a feature_penalty only work for ``C_API_DTYPE_FLOAT64``. |
it seems that we cannot allow these memory savings, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for missing this comment.
yes, the field
in dataset don't allow arbitrary data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I could have a parameter, maybe called new_type
with default value None
. And when it is not none, convert to the new type if it is not that type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guolinke Seems we always need to force the type, except training/test data, but it's not going through this function. So, I can simply remove that if
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @guolinke
Closed #2380.