-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Create Dataset from multiple data files #4089
Conversation
Other potential improvements: seq[[1,2,5,7,9,....]] such, data provider could provide more efficient reading than current one-row-at-time fashion under the hood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert is_class implementation or use pytypes
python-package/lightgbm/basic.py
Outdated
@@ -1383,8 +1439,15 @@ def _lazy_init(self, data, label=None, reference=None, | |||
self.__init_from_csc(data, params_str, ref_dataset) | |||
elif isinstance(data, np.ndarray): | |||
self.__init_from_np2d(data, params_str, ref_dataset) | |||
elif isinstance(data, list) and len(data) > 0 and all(isinstance(x, np.ndarray) for x in data): | |||
self.__init_from_list_np2d(data, params_str, ref_dataset) | |||
elif isinstance(data, Sequence): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: revert is_class
implementation
This would force any 3rd party lib (presumably Data Access Layer libs) to import LightGBM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we need an duck typing implementation.
typechecked
and is_of_type
would do the job yet requires new dependency pytypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: fix unwanted memory reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@shiyu1994 this PR is now ready. Are you still available for reviewing? |
python-package/lightgbm/basic.py
Outdated
return indices | ||
|
||
def init_from_sample(self, sample_data, sample_indices, sample_cnt, total_nrow): | ||
"""Get the used parameters in the Dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is mismatched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean Get the used parameters in the Dataset.
wrongfully describes what this method does?
Yes, I confirm we lost track of where this summary was from. Nevertheless, will fix.
python-package/lightgbm/basic.py
Outdated
|
||
return filtered, filtered_idx | ||
|
||
def __init_from_seqs(self, seqs, params_str, ref_dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused parameters params_str
and ref_dataset
. It seems that ref_dataset
is ignored when we constructing from seqs
. Does that mean this method only supports constructing training dataset? Because for a validation dataset, a reference training dataset is required in Python API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much appreciated for making clear on what ref_dataset
does.
Does that mean this method only supports constructing training dataset?
Currently, yes. However after your explanation, It seems validation dataset can also be supported via trivial changes to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think now that we've supported creating from seqs
for training data, we should also support that for validation data. Otherwise, users can get unexpected behavior if they try to create validation dataset from seqs
. Because the only difference between validation and training datasets in Python API is whether reference
parameter in lightgbm.Dataset
is provided.
I'm willing to provide any help.
python-package/lightgbm/basic.py
Outdated
Supports random access and access by batch if properly defined by user | ||
""" | ||
total_nrow = sum(len(seq) for seq in seqs) | ||
ncol = len(seqs[0][0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable ncol
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was left out for debugging propose and really should be further extended for data sanity check before loading them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However data check could impact performance since Sequence
interface keeps data access unknown to LightGBM.
I suggest adding a new parameter for toggling checking behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we need to check that each seq
has exactly the same number of columns? Doing that here is perfectly ok and won't incur efficiency degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we need to check that each
seq
has exactly the same number of columns?
exactly
Doing that here is perfectly ok and won't incur efficiency degradation.
Actually we wouldn't know that for sure.
To exemplify, say our user decided to implement some seq
s with some very heavy data provider backend and somehow implemented data prefetching and parallelism as their optimization. (In extreme cases imagine data are generated on the fly from some server instances spawned upon our touching
to the data, (or literal cluster of blue-ray discs for data storage shelf moved to some disk readers for preparation @facebook @aws )),
They might would trigger the generation/read for their whole seq
when we touch the seq
since they would expect data are accessed in ordered fashion.
To generalize, we really cannot make assumptions on how much a single access to data would cost. We hope they would provide best performance for both random data access and by batch access. However in reality, these two are naturally conflicting each other in someway. Assuring batch access indeed in-order, at least giving an option to do so, is the best we can do to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a new parameter for toggling checking behavior.
Actually some helps to this would be very much helpful.
Is it possible to create a param
for python only, in particular, not to pass down to or cause cpp to warn unrecognized parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is possible. We need to pick out the parameter in Python before passing it to CPP, which is not an elegant solution.
Also, we'd better keep the set of parameters in all API consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed via bbd18b9
src/io/dataset_loader.cpp
Outdated
|
||
/* | ||
int num_col = static_cast<int>(sample_values.size()); | ||
std::vector<double*> sample_values_ptr(num_col); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless code should be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also for debugging. If you like the idea of keep debugging code, I'd move those code into a separarte function that would only be called when debugging is enabled.
src/io/dataset_loader.cpp
Outdated
@@ -615,10 +615,38 @@ Dataset* DatasetLoader::LoadFromBinFile(const char* data_filename, const char* b | |||
return dataset.release(); | |||
} | |||
|
|||
// To help verify whether sample data is changed when using different language bindings. | |||
static void DumpSampleData(const std::string& sample_filename, double** sample_values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this method is necessary to be included in the master branch. Is it for debugging? It seems that all calls to ConstructFromSampleData
in the project will pass an empty dump_filename
, so this function is not called in the source code actually. And no API is provided for users to access this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is for debugging. I suggest we find some acceptable way to keep these debugging code. It helped a lot to find problems when we develop this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, c api exposes the function LGBM_DatasetDumpText
for debugging. How about we add an option (say dump_sample
) when creating Dataset in Python API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better remove these debug code, which will increase maintenance effort. Unlike dumping a whole Dataset
, dumping the sampling part used by construction is seldom needed. And I believe we can manually do so whenever necessary in development.
Thanks again for your contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll remove those debugging code this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shiyu1994 Debugging code removed now. Please let me know if there's anything that should be improved.
@pytest.mark.parametrize('include_nan', [False, True]) | ||
@pytest.mark.parametrize('num_seq', [1, 3]) | ||
def test_sequence(tmpdir, sample_count, batch_size, include_0, include_nan, num_seq): | ||
rm_files(["seq.truth.bin", "seq.seq.bin"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that rm_files
removes seq.truth.bin
and seq.seq.bin
, but subsequently tmpdir/seq.truth.bin
and tmpdir/seq.seq.bin
will be created. That's a mismatch.
And why do we need to manually remove the files created under tmpdir
here, which is not done in other testing cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to manually remove the files created under tmpdir here, which is not done in other testing cases?
we don't. tmpdir
is unique per parameter.
This is indeed a left-out from previous code where tmpdir
was not used.
Will fix.
@cyfdecyf Thanks for your great contribution! I'm comment about my concerns. Overall the implementation is great, but there's seems to be some unnecessary changes. |
@cyfdecyf Thanks for removing the debugging code. Would you like to continue to complete this PR for the validation datasets? If not, I can help to do that. |
It's on plan, will do in 10 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, @cyfdecyf ! I left some comments related to consistency with LightGBM's existing codebase and removing unrelated changes from this PR.
I haven't reviewed the actual functional changes in the PR yet, since it seems like there is still some ongoing discussion about it (e.g. https://github.com/microsoft/LightGBM/pull/4089/files#r598454555, https://github.com/microsoft/LightGBM/pull/4089/files#r612925422, https://github.com/microsoft/LightGBM/pull/4089/files#r612950521)
.ci/test.sh
Outdated
echo "..pycodestyle" | ||
pycodestyle --ignore=E501,W503 --exclude=./.nuget,./external_libs . || exit -1 | ||
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1 | ||
echo "..pydocstyle" | ||
pydocstyle --convention=numpy --add-ignore=D105 --match-dir='^(?!^external_libs|test|example).*' --match='(?!^test_|setup).*\.py' . || exit -1 | ||
echo "..isort" | ||
isort . --check-only || exit -1 | ||
echo "..mypy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove these changes adding new echo
statements and moving the order of CI checks around? They don't seem related to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed via 941dfce
python-package/lightgbm/basic.py
Outdated
reduce memory usage. | ||
""" | ||
|
||
__metaclass__ = abc.ABCMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain what benefit this implementation gets from using abc
that it couldn't achieve with just using NotImplementedError
for classes that are intended to be abstract? I'd like to understand that before lightgbm
takes on this new dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Dataset._lazy_init
there's many isinstance
calls to decide which init function to use. Using abc
allows us to keep use isinstance
and keep consistency there.
The initial PR does not contain is_class
method and thus this is necessary. For now, using abc
is not necessary and we can just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add some more context:
Generally we need an duck typing implementation.
typechecked
andis_of_type
would do the job yet requires new dependency pytypes.
Originally posted by @Willian-Zhang in #4089 (comment)
pytypes
was originally being used for duck typing implementation.
It is removed in consider to minimize new dependency added to LightGBM.
is_class
as along some reordering on is_instance
s was added as a trade-off.
@cyfdecyf has some good reason to adopt ABC
route as stated in this thread, and here we are.
Currently, LightGBM tries to call is_class
from this commit:
a8bc7d9#diff-5e0fb2a7b1ec4988ccde5e74a6c4b609841c1143f281da2afc85e8c464d37b3bR625-R630
to determine if a class instance ducks Sequence
dues to ABC implementation would result in LightGBM misinterpreting non-subclassing seq
object to other types. (convertible to ndarray or csr_matrix, cannot recall exact one.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed via 591fc5e
src/io/dataset.cpp
Outdated
@@ -927,7 +927,7 @@ bool Dataset::GetIntField(const char* field_name, data_size_t* out_len, | |||
|
|||
void Dataset::SaveBinaryFile(const char* bin_filename) { | |||
if (bin_filename != nullptr && std::string(bin_filename) == data_filename_) { | |||
Log::Warning("Bianry file %s already exists", bin_filename); | |||
Log::Warning("Binary file %s already exists", bin_filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! But please remove this unrelated change and make a separate pull request for it.
In general, we have a preference for many small pull requests over a few large ones. A pull request that just changes this typo would be easy to review and merged very quickly, and then the size of the diff for this PR would be reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll removing this first.
python-package/lightgbm/basic.py
Outdated
raise NotImplementedError("remove this line if subclassing") | ||
|
||
@abc.abstractmethod | ||
def __len__(self): # type: () -> int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __len__(self): # type: () -> int | |
def __len__(self) -> int: |
Is there a reason that you prefer to use type hint comments instead of type hints in code? As of #3581 this project no longer supports Python 2.x, so you can use type hints in code directly.
In fact, based on #3756 we have a preference for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is a specific reason to use comments, can you please remove them in this PR and replace them with type hints in code? For consistency with the rest of the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for Python 2.x support. It's a pleasure to drop Python 2.x support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed via a2b0ec9
seqs.append(seq) | ||
ds = lgb.Dataset(seqs, label=Y, params=params) | ||
ds.save_binary(str(tmpdir / "seq.seq.bin")) | ||
assert filecmp.cmp(tmpdir / "seq.truth.bin", tmpdir / "seq.seq.bin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please use os.path.join()
instead of pathlib
-style paths in these tests? For consistency with the rest of LightGBM's tests, for example
LightGBM/tests/python_package_test/test_sklearn.py
Lines 116 to 123 in 5014f19
X_train, y_train = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)), | |
'../../examples/lambdarank/rank.train')) | |
X_test, y_test = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)), | |
'../../examples/lambdarank/rank.test')) | |
q_train = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)), | |
'../../examples/lambdarank/rank.train.query')) | |
q_test = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)), | |
'../../examples/lambdarank/rank.test.query')) |
If you have a preference for pathlib
-style paths over constructing them with os.path
you are welcome to open a new issue proposing that and explaining why, but if it's not absolutely necessary for this pull request I'd prefer to keep new code consistent with the existing codebase. That consistency helps a lot for the small team of volunteers maintaining this project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed via 60fe403
validation dataset is completed with via 92d0b39 tho it seems saved binary file for validation datasets from same data are not guaranteed to stay identical.
|
src/c_api.cpp
Outdated
|
||
auto sample_indices = CreateSampleIndices(config, total_nrow); | ||
|
||
static_assert (sizeof(int) == 4, "int size is not 4"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this assert statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is useless now and removed.
CreateSampleIndicies
returns std::vector<int>
at first. It's now returning std::vector<int32_t>
by implicit type convertion from Random.Sample
's return value.
files = [files] | ||
for file in files: | ||
if os.path.exists(file): | ||
os.remove(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before, this should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed via 2d32b87
ds.save_binary(os.path.join(tmpdir, "seq.seq.bin")) | ||
|
||
if create_valid: | ||
# TODO: verify validation dataset somehow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for validation dataset.
Are these attributes from a validation dataset loaded from bin file? Lines 726 to 728 in f831808
|
I think BTW, thank you all for your hard work and detailed review for this PR. Really appreciate your contribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor cleaning-up comments below.
Thanks a lot for taking a look! Do you think we need feature request for this? Also, please update your review for this PR because there have been a lot of code changes at cpp side since your latest approval (now there are 2 new C API functions, for example). |
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@StrikerRUS Thanks for the cleanup. I should have done renaming with find and replace.. |
@StrikerRUS Write to #2302. Sure, I'll update my review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked again, only a few typos to be fixed and DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT
to be removed.
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
@shiyu1994 Thanks for the review. Just addressed all the comments and pushed my changes. |
@cyfdecyf @StrikerRUS Sorry, after a careful investigation, I found that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so so much for this awesome contribution!
LGTM! I don't have any other comments. Thanks for your patience!
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Commit 8a0f07a introduced In line 87 in the following code, if LightGBM/include/LightGBM/utils/random.h Lines 82 to 91 in 8a0f07a
|
I see. But it seems that the latest version of LightGBM/include/LightGBM/utils/random.h Lines 88 to 95 in 6a195a1
In the loop above, each iteration a new element will be added into the set. But as #4371 pointed out, there's a bug in line 91, NextInt(0, r) should be NextInt(0, r + 1) . I think we can fix this, and remove the recalculation of sample_cnt in the code in another PR.
|
@StrikerRUS @jameslamb Thank you both for your careful review. Can we merge this PR now? |
Yeah, I believe we can continue (#4403 and #4089 (comment)) in a follow-up PRs. @cyfdecyf Great contribution, many thanks! |
Thanks for all the reviews. I'll start working on #4403 next week. |
@cyfdecyf Also, #4450 has been merged, so please check #4089 (comment)
|
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This PR implements creating Dataset from multiple data files.
We add a
Sequence
interface which requires data providers implement:With this interface, we can create
Dataset
fromList[Sequence]
. Also refer this to #2789 as I think this interface allows we use other types of efficient binary data format. I closed #3900 because it's an ugly hack and data loading features are much easier to implement in Python.An example of creating
Dataset
from multiple HDF5 files is also given. As I mentioned in #2788, I'm more familiar with HDF5 file so it's choosen as the example.#2031 also mentions the feature request, guess this is a common feature request.