[python-package] Create Dataset from multiple data files #4089

cyfdecyf · 2021-03-21T10:27:47Z

This PR implements creating Dataset from multiple data files.

We add a Sequence interface which requires data providers implement:

Return total number of data
Random access
- so we can sample data without going through data in two round
Range access
- used to read data in batch thus avoid loading all data into memory

With this interface, we can create Dataset from List[Sequence]. Also refer this to #2789 as I think this interface allows we use other types of efficient binary data format. I closed #3900 because it's an ugly hack and data loading features are much easier to implement in Python.

An example of creating Dataset from multiple HDF5 files is also given. As I mentioned in #2788, I'm more familiar with HDF5 file so it's choosen as the example.

#2031 also mentions the feature request, guess this is a common feature request.

python-package/lightgbm/basic.py

Willian-Zhang · 2021-03-26T03:51:04Z

Other potential improvements:
Sequence objects can be easily extended with support of batch reading with row numbers:

seq[[1,2,5,7,9,....]]

such, data provider could provide more efficient reading than current one-row-at-time fashion under the hood.

Willian-Zhang

revert is_class implementation or use pytypes

Willian-Zhang · 2021-03-26T07:37:58Z

python-package/lightgbm/basic.py

@@ -1383,8 +1439,15 @@ def _lazy_init(self, data, label=None, reference=None,
            self.__init_from_csc(data, params_str, ref_dataset)
        elif isinstance(data, np.ndarray):
            self.__init_from_np2d(data, params_str, ref_dataset)
-        elif isinstance(data, list) and len(data) > 0 and all(isinstance(x, np.ndarray) for x in data):
-            self.__init_from_list_np2d(data, params_str, ref_dataset)
+        elif isinstance(data, Sequence):


TODO: revert is_class implementation

This would force any 3rd party lib (presumably Data Access Layer libs) to import LightGBM.

Generally we need an duck typing implementation.
typechecked and is_of_type would do the job yet requires new dependency pytypes.

Willian-Zhang

TODO: fix unwanted memory reference

python-package/lightgbm/basic.py

Willian-Zhang

LGTM

Willian-Zhang · 2021-03-30T02:54:13Z

@shiyu1994 this PR is now ready. Are you still available for reviewing?

shiyu1994 · 2021-04-14T02:59:15Z

python-package/lightgbm/basic.py

+        return indices
+
+    def init_from_sample(self, sample_data, sample_indices, sample_cnt, total_nrow):
+        """Get the used parameters in the Dataset.


This description is mismatched.

Do you mean Get the used parameters in the Dataset. wrongfully describes what this method does?

Yes, I confirm we lost track of where this summary was from. Nevertheless, will fix.

shiyu1994 · 2021-04-14T04:18:53Z

python-package/lightgbm/basic.py

+
+        return filtered, filtered_idx
+
+    def __init_from_seqs(self, seqs, params_str, ref_dataset):


Unused parameters params_str and ref_dataset. It seems that ref_dataset is ignored when we constructing from seqs. Does that mean this method only supports constructing training dataset? Because for a validation dataset, a reference training dataset is required in Python API.

Very much appreciated for making clear on what ref_dataset does.

Does that mean this method only supports constructing training dataset?

Currently, yes. However after your explanation, It seems validation dataset can also be supported via trivial changes to make.

Ok. I think now that we've supported creating from seqs for training data, we should also support that for validation data. Otherwise, users can get unexpected behavior if they try to create validation dataset from seqs. Because the only difference between validation and training datasets in Python API is whether reference parameter in lightgbm.Dataset is provided.

I'm willing to provide any help.

shiyu1994 · 2021-04-14T04:19:09Z

python-package/lightgbm/basic.py

+            Supports random access and access by batch if properly defined by user
+        """
+        total_nrow = sum(len(seq) for seq in seqs)
+        ncol = len(seqs[0][0])


Unused variable ncol.

This was left out for debugging propose and really should be further extended for data sanity check before loading them.

However data check could impact performance since Sequence interface keeps data access unknown to LightGBM.

I suggest adding a new parameter for toggling checking behavior.

Do you mean we need to check that each seq has exactly the same number of columns? Doing that here is perfectly ok and won't incur efficiency degradation.

Do you mean we need to check that each seq has exactly the same number of columns?

exactly

Doing that here is perfectly ok and won't incur efficiency degradation.

Actually we wouldn't know that for sure.
To exemplify, say our user decided to implement some seqs with some very heavy data provider backend and somehow implemented data prefetching and parallelism as their optimization. (In extreme cases imagine data are generated on the fly from some server instances spawned upon our touching to the data, (or literal cluster of blue-ray discs for data storage shelf moved to some disk readers for preparation @facebook @aws )),

They might would trigger the generation/read for their whole seq when we touch the seq since they would expect data are accessed in ordered fashion.

To generalize, we really cannot make assumptions on how much a single access to data would cost. We hope they would provide best performance for both random data access and by batch access. However in reality, these two are naturally conflicting each other in someway. Assuring batch access indeed in-order, at least giving an option to do so, is the best we can do to that.

I suggest adding a new parameter for toggling checking behavior.

Actually some helps to this would be very much helpful.

Is it possible to create a param for python only, in particular, not to pass down to or cause cpp to warn unrecognized parameter?

Yes, that is possible. We need to pick out the parameter in Python before passing it to CPP, which is not an elegant solution.
Also, we'd better keep the set of parameters in all API consistent.

removed via bbd18b9

shiyu1994 · 2021-04-14T04:20:58Z

src/io/dataset_loader.cpp

+
+  /*
+  int num_col = static_cast<int>(sample_values.size());
+  std::vector<double*> sample_values_ptr(num_col);


Useless code should be deleted.

This is also for debugging. If you like the idea of keep debugging code, I'd move those code into a separarte function that would only be called when debugging is enabled.

shiyu1994 · 2021-04-14T04:28:06Z

src/io/dataset_loader.cpp

@@ -615,10 +615,38 @@ Dataset* DatasetLoader::LoadFromBinFile(const char* data_filename, const char* b
  return dataset.release();
 }

+// To help verify whether sample data is changed when using different language bindings.
+static void DumpSampleData(const std::string& sample_filename, double** sample_values,


I'm not sure whether this method is necessary to be included in the master branch. Is it for debugging? It seems that all calls to ConstructFromSampleData in the project will pass an empty dump_filename, so this function is not called in the source code actually. And no API is provided for users to access this method.

Yes, this is for debugging. I suggest we find some acceptable way to keep these debugging code. It helped a lot to find problems when we develop this feature.

BTW, c api exposes the function LGBM_DatasetDumpText for debugging. How about we add an option (say dump_sample) when creating Dataset in Python API?

I think we'd better remove these debug code, which will increase maintenance effort. Unlike dumping a whole Dataset, dumping the sampling part used by construction is seldom needed. And I believe we can manually do so whenever necessary in development.
Thanks again for your contribution.

OK. I'll remove those debugging code this week.

@shiyu1994 Debugging code removed now. Please let me know if there's anything that should be improved.

shiyu1994 · 2021-04-14T04:34:05Z

tests/python_package_test/test_basic.py

+@pytest.mark.parametrize('include_nan', [False, True])
+@pytest.mark.parametrize('num_seq', [1, 3])
+def test_sequence(tmpdir, sample_count, batch_size, include_0, include_nan, num_seq):
+    rm_files(["seq.truth.bin", "seq.seq.bin"])


It seems that rm_files removes seq.truth.bin and seq.seq.bin, but subsequently tmpdir/seq.truth.bin and tmpdir/seq.seq.bin will be created. That's a mismatch.
And why do we need to manually remove the files created under tmpdir here, which is not done in other testing cases?

why do we need to manually remove the files created under tmpdir here, which is not done in other testing cases?

we don't. tmpdir is unique per parameter.

This is indeed a left-out from previous code where tmpdir was not used.

Will fix.

shiyu1994 · 2021-04-14T04:36:24Z

@cyfdecyf Thanks for your great contribution! I'm comment about my concerns. Overall the implementation is great, but there's seems to be some unnecessary changes.

microsoft#4089 (comment)

shiyu1994 · 2021-04-22T06:28:09Z

@cyfdecyf Thanks for removing the debugging code. Would you like to continue to complete this PR for the validation datasets? If not, I can help to do that.

Willian-Zhang · 2021-04-22T07:10:44Z

@cyfdecyf Thanks for removing the debugging code. Would you like to continue to complete this PR for the validation datasets? If not, I can help to do that.

It's on plan, will do in 10 days.

jameslamb

Thanks for this, @cyfdecyf ! I left some comments related to consistency with LightGBM's existing codebase and removing unrelated changes from this PR.

I haven't reviewed the actual functional changes in the PR yet, since it seems like there is still some ongoing discussion about it (e.g. https://github.com/microsoft/LightGBM/pull/4089/files#r598454555, https://github.com/microsoft/LightGBM/pull/4089/files#r612925422, https://github.com/microsoft/LightGBM/pull/4089/files#r612950521)

jameslamb · 2021-04-26T14:24:18Z

.ci/test.sh

+    echo "..pycodestyle"
    pycodestyle --ignore=E501,W503 --exclude=./.nuget,./external_libs . || exit -1
-    pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1
+    echo "..pydocstyle"
+    pydocstyle --convention=numpy --add-ignore=D105 --match-dir='^(?!^external_libs|test|example).*' --match='(?!^test_|setup).*\.py' . || exit -1
+    echo "..isort"
    isort . --check-only || exit -1
+    echo "..mypy"


Can you please remove these changes adding new echo statements and moving the order of CI checks around? They don't seem related to this PR.

fixed via 941dfce

jameslamb · 2021-04-26T14:28:31Z

python-package/lightgbm/basic.py

+    reduce memory usage.
+    """
+
+    __metaclass__ = abc.ABCMeta


Can you please explain what benefit this implementation gets from using abc that it couldn't achieve with just using NotImplementedError for classes that are intended to be abstract? I'd like to understand that before lightgbm takes on this new dependency.

In Dataset._lazy_init there's many isinstance calls to decide which init function to use. Using abc allows us to keep use isinstance and keep consistency there.

The initial PR does not contain is_class method and thus this is necessary. For now, using abc is not necessary and we can just remove it.

Just to add some more context:

Generally we need an duck typing implementation.
typechecked and is_of_type would do the job yet requires new dependency pytypes.

Originally posted by @Willian-Zhang in #4089 (comment)

pytypes was originally being used for duck typing implementation.
It is removed in consider to minimize new dependency added to LightGBM.
is_class as along some reordering on is_instances was added as a trade-off.
@cyfdecyf has some good reason to adopt ABC route as stated in this thread, and here we are.

Currently, LightGBM tries to call is_class from this commit:
a8bc7d9#diff-5e0fb2a7b1ec4988ccde5e74a6c4b609841c1143f281da2afc85e8c464d37b3bR625-R630
to determine if a class instance ducks Sequence dues to ABC implementation would result in LightGBM misinterpreting non-subclassing seq object to other types. (convertible to ndarray or csr_matrix, cannot recall exact one.)

fixed via 591fc5e

src/c_api.cpp

jameslamb · 2021-04-26T14:29:51Z

src/io/dataset.cpp

@@ -927,7 +927,7 @@ bool Dataset::GetIntField(const char* field_name, data_size_t* out_len,

 void Dataset::SaveBinaryFile(const char* bin_filename) {
  if (bin_filename != nullptr && std::string(bin_filename) == data_filename_) {
-    Log::Warning("Bianry file %s already exists", bin_filename);
+    Log::Warning("Binary file %s already exists", bin_filename);


Thanks for fixing this! But please remove this unrelated change and make a separate pull request for it.

In general, we have a preference for many small pull requests over a few large ones. A pull request that just changes this typo would be easy to review and merged very quickly, and then the size of the diff for this PR would be reduced.

OK. I'll removing this first.

jameslamb · 2021-04-26T14:32:56Z

python-package/lightgbm/basic.py

+        raise NotImplementedError("remove this line if subclassing")
+
+    @abc.abstractmethod
+    def __len__(self):  # type: () -> int


Suggested change

def __len__(self): # type: () -> int

def __len__(self) -> int:

Is there a reason that you prefer to use type hint comments instead of type hints in code? As of #3581 this project no longer supports Python 2.x, so you can use type hints in code directly.

In fact, based on #3756 we have a preference for it.

Unless there is a specific reason to use comments, can you please remove them in this PR and replace them with type hints in code? For consistency with the rest of the project.

It's for Python 2.x support. It's a pleasure to drop Python 2.x support.

fixed via a2b0ec9

jameslamb · 2021-04-26T14:37:05Z

tests/python_package_test/test_basic.py

+            seqs.append(seq)
+    ds = lgb.Dataset(seqs, label=Y, params=params)
+    ds.save_binary(str(tmpdir / "seq.seq.bin"))
+    assert filecmp.cmp(tmpdir / "seq.truth.bin", tmpdir / "seq.seq.bin")


Can you please use os.path.join() instead of pathlib-style paths in these tests? For consistency with the rest of LightGBM's tests, for example

LightGBM/tests/python_package_test/test_sklearn.py

Lines 116 to 123 in 5014f19

X_train, y_train = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)),

'../../examples/lambdarank/rank.train'))

X_test, y_test = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)),

'../../examples/lambdarank/rank.test'))

q_train = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)),

'../../examples/lambdarank/rank.train.query'))

q_test = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)),

'../../examples/lambdarank/rank.test.query'))

If you have a preference for pathlib-style paths over constructing them with os.path you are welcome to open a new issue proposing that and explaining why, but if it's not absolutely necessary for this pull request I'd prefer to keep new code consistent with the existing codebase. That consistency helps a lot for the small team of volunteers maintaining this project.

fixed via 60fe403

Willian-Zhang · 2021-05-06T09:09:17Z

@cyfdecyf Thanks for removing the debugging code. Would you like to continue to complete this PR for the validation datasets? If not, I can help to do that.

validation dataset is completed with via 92d0b39

tho it seems saved binary file for validation datasets from same data are not guaranteed to stay identical.
I'm not so sure if this is an expected behavior.

Some observations:

[LightGBM] [Info] num_features_ 0
[LightGBM] [Info] num_total_features_ 10
[LightGBM] [Info] label_idx_ 0
[LightGBM] [Info] max_bin_ 0
[LightGBM] [Info] bin_construct_sample_cnt_ -1359898024
[LightGBM] [Info] min_data_in_bin_ 32723
[LightGBM] [Info] zero_as_missing_ 248
[LightGBM] [Info] has_raw_ 0

examples/python-guide/dataset_from_multi_hdf5.py

shiyu1994 · 2021-05-07T05:19:55Z

src/c_api.cpp

+
+  auto sample_indices = CreateSampleIndices(config, total_nrow);
+
+  static_assert (sizeof(int) == 4, "int size is not 4");


What's the purpose of this assert statement?

This is useless now and removed.

CreateSampleIndicies returns std::vector<int> at first. It's now returning std::vector<int32_t> by implicit type convertion from Random.Sample's return value.

shiyu1994 · 2021-05-07T05:20:50Z

tests/python_package_test/test_basic.py

+        files = [files]
+    for file in files:
+        if os.path.exists(file):
+            os.remove(file)


As mentioned before, this should be removed.

removed via 2d32b87

shiyu1994 · 2021-05-07T05:28:56Z

tests/python_package_test/test_basic.py

+    ds.save_binary(os.path.join(tmpdir, "seq.seq.bin"))
+
+    if create_valid:
+        # TODO: verify validation dataset somehow


Add tests for validation dataset.

shiyu1994 · 2021-05-07T05:36:20Z

[LightGBM] [Info] num_features_ 0
[LightGBM] [Info] num_total_features_ 10
[LightGBM] [Info] label_idx_ 0
[LightGBM] [Info] max_bin_ 0
[LightGBM] [Info] bin_construct_sample_cnt_ -1359898024
[LightGBM] [Info] min_data_in_bin_ 32723
[LightGBM] [Info] zero_as_missing_ 248
[LightGBM] [Info] has_raw_ 0

Are these attributes from a validation dataset loaded from bin file?
For validation dataset, num_features_ 0 is not expected, since it should be set in Dataset::CreateValid, which is called by LGBM_DatasetCreateByReference.

LightGBM/src/io/dataset.cpp

Lines 726 to 728 in f831808

    
           void Dataset::CreateValid(const Dataset* dataset) { 
        
             feature_groups_.clear(); 
        
             num_features_ = dataset->num_features_;

shiyu1994 · 2021-06-28T13:57:05Z

While adding C API LGBM_SampleCount, I noticed rand.Sample may return number of values less than requested sample count.
#4089 (comment)

@shiyu1994 Could you please take a look?

I think Sample only returns K elements in expectation, but does not guarantee to return exactly K elements. We need to figure out an efficient way to Sample exactly K elements.

BTW, thank you all for your hard work and detailed review for this PR. Really appreciate your contribution.

StrikerRUS

Just some minor cleaning-up comments below.

docs/Python-Intro.rst

include/LightGBM/c_api.h

src/c_api.cpp

include/LightGBM/c_api.h

src/c_api.cpp

StrikerRUS · 2021-06-28T16:28:33Z

@shiyu1994

We need to figure out an efficient way to Sample exactly K elements.

Thanks a lot for taking a look! Do you think we need feature request for this?

Also, please update your review for this PR because there have been a lot of code changes at cpp side since your latest approval (now there are 2 new C API functions, for example).

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

cyfdecyf · 2021-06-28T23:53:23Z

@StrikerRUS Thanks for the cleanup. I should have done renaming with find and replace..

shiyu1994 · 2021-06-29T02:08:52Z

@StrikerRUS Write to #2302. Sure, I'll update my review now.

shiyu1994

I've checked again, only a few typos to be fixed and DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT to be removed.

include/LightGBM/c_api.h

python-package/lightgbm/basic.py

Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>

cyfdecyf · 2021-06-30T00:04:05Z

@shiyu1994 Thanks for the review. Just addressed all the comments and pushed my changes.

shiyu1994 · 2021-06-30T04:55:17Z

@cyfdecyf @StrikerRUS Sorry, after a careful investigation, I found that the Sample function always returns exactly K elements. I should delete the feature request. But I don't know why the sample_cnt should be recalculated as mentioned here #4089 (comment).

StrikerRUS

Thank you so so much for this awesome contribution!
LGTM! I don't have any other comments. Thanks for your patience!

python-package/lightgbm/basic.py

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

cyfdecyf · 2021-07-01T00:50:06Z

Commit 8a0f07a introduced sample_cnt assignement from sampled indices vector.

In line 87 in the following code, if NextShort always return a value near max_step, then we may sample less then K values.

LightGBM/include/LightGBM/utils/random.h

Lines 82 to 91 in 8a0f07a

    
           int min_step = 1; 
        
           int avg_step = N / K; 
        
           int max_step = 2 * avg_step - min_step; 
        
           int start = -1; 
        
           for (int i = 0; i < K; ++i) { 
        
             int step = NextShort(min_step, max_step + 1); 
        
             start += step; 
        
             if (start >= N) { break; } 
        
             ret.push_back(start); 
        
           }

shiyu1994 · 2021-07-01T06:22:54Z

I see. But it seems that the latest version of Sample won't have such problem.

LightGBM/include/LightGBM/utils/random.h

Lines 88 to 95 in 6a195a1

    
           } else { 
        
             std::set<int> sample_set; 
        
             for (int r = N - K; r < N; ++r) { 
        
               int v = NextInt(0, r); 
        
               if (!sample_set.insert(v).second) { 
        
                 sample_set.insert(r); 
        
               } 
        
             }

In the loop above, each iteration a new element will be added into the set. But as #4371 pointed out, there's a bug in line 91, NextInt(0, r) should be NextInt(0, r + 1). I think we can fix this, and remove the recalculation of sample_cnt in the code in another PR.

shiyu1994 · 2021-07-02T01:57:43Z

@StrikerRUS @jameslamb Thank you both for your careful review. Can we merge this PR now?

StrikerRUS · 2021-07-02T12:15:36Z

Yeah, I believe we can continue (#4403 and #4089 (comment)) in a follow-up PRs.

@cyfdecyf Great contribution, many thanks!

cyfdecyf · 2021-07-06T23:30:35Z

Thanks for all the reviews. I'll start working on #4403 next week.

StrikerRUS · 2021-07-07T19:54:54Z

I'll start working on #4403 next week.

@cyfdecyf
Many thanks!

Also, #4450 has been merged, so please check #4089 (comment)

... and remove the recalculation of sample_cnt in the code in another PR.

github-actions · 2023-08-23T19:16:32Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

cyfdecyf requested review from btrotta, chivee, guolinke, henry0312, jameslamb, shiyu1994, StrikerRUS and wxchan as code owners March 21, 2021 10:27

cyfdecyf commented Mar 22, 2021

View reviewed changes

python-package/lightgbm/basic.py Show resolved Hide resolved

jameslamb added the feature label Mar 22, 2021

Willian-Zhang suggested changes Mar 26, 2021

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

cyfdecyf mentioned this pull request Mar 28, 2021

Precise text file parsing #4081

Merged

Willian-Zhang approved these changes Mar 30, 2021

View reviewed changes

StrikerRUS mentioned this pull request Apr 9, 2021

Multi file as input for LightGBM #2031

Closed

shiyu1994 reviewed Apr 14, 2021

View reviewed changes

Willian-Zhang added a commit to cyfdecyf/LightGBM that referenced this pull request Apr 16, 2021

[py] remove rm files in test seq

ae7b18d

microsoft#4089 (comment)

Willian-Zhang mentioned this pull request Apr 16, 2021

Validation dataset creation via Sequence #4184

Closed

Willian-Zhang added a commit to cyfdecyf/LightGBM that referenced this pull request Apr 16, 2021

docs(python): init_from_sample summary

1990980

microsoft#4089 (comment)

Willian-Zhang mentioned this pull request Apr 26, 2021

Regression Tests and gitlab CI support #4228

Closed

jameslamb requested changes Apr 26, 2021

View reviewed changes

cyfdecyf mentioned this pull request Apr 28, 2021

[feature request] accelerate the data loading from file #2788

Closed

shiyu1994 reviewed May 7, 2021

View reviewed changes

StrikerRUS reviewed Jun 28, 2021

View reviewed changes

StrikerRUS requested a review from shiyu1994 June 28, 2021 16:29

Apply suggestions from code review

519d4b9

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

shiyu1994 mentioned this pull request Jun 29, 2021

Feature Requests & Voting Hub #2302

Open

shiyu1994 reviewed Jun 29, 2021

View reviewed changes

include/LightGBM/c_api.h Outdated Show resolved Hide resolved

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

cyfdecyf and others added 2 commits June 30, 2021 07:41

Apply suggestions from code review

618758a

Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>

Remove no longer used DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT.

553cecc

StrikerRUS approved these changes Jun 30, 2021

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

Update python-package/lightgbm/basic.py

a614436

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

StrikerRUS removed the in progress label Jul 2, 2021

StrikerRUS merged commit c359896 into microsoft:master Jul 2, 2021

StrikerRUS mentioned this pull request Jul 4, 2021

[docs][python] add versionadded to Sequence class in Python wrapper #4441

Merged

cyfdecyf mentioned this pull request Jul 16, 2021

[refactor] Use CreateSampleIndices() in c_api.cpp #4478

Merged

jameslamb mentioned this pull request Oct 12, 2021

The Recommendation for Training Big Image Data that Cannot Fit into Memory #4672

Closed

shiyu1994 mentioned this pull request Nov 8, 2021

Support Target and Count Encoding for Sequence Input #4780

Open

jameslamb mentioned this pull request Mar 14, 2022

Fitting LGBMClassifier on a Dataset #5074

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023


		return filtered, filtered_idx

		def __init_from_seqs(self, seqs, params_str, ref_dataset):

	def __len__(self): # type: () -> int
	def __len__(self) -> int:

	X_train, y_train = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)),
	'../../examples/lambdarank/rank.train'))
	X_test, y_test = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)),
	'../../examples/lambdarank/rank.test'))
	q_train = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)),
	'../../examples/lambdarank/rank.train.query'))
	q_test = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)),
	'../../examples/lambdarank/rank.test.query'))


		auto sample_indices = CreateSampleIndices(config, total_nrow);

		static_assert (sizeof(int) == 4, "int size is not 4");

[python-package] Create Dataset from multiple data files #4089

[python-package] Create Dataset from multiple data files #4089

Conversation

cyfdecyf commented Mar 21, 2021

Willian-Zhang commented Mar 26, 2021

Willian-Zhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Willian-Zhang left a comment

Choose a reason for hiding this comment

Willian-Zhang left a comment

Choose a reason for hiding this comment

Willian-Zhang commented Mar 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Willian-Zhang Apr 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Willian-Zhang Apr 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Willian-Zhang May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Willian-Zhang Apr 14, 2021 • edited Loading

Choose a reason for hiding this comment

shiyu1994 commented Apr 14, 2021

shiyu1994 commented Apr 22, 2021

Willian-Zhang commented Apr 22, 2021

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Willian-Zhang Apr 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Willian-Zhang Apr 30, 2021 • edited Loading

Choose a reason for hiding this comment

Willian-Zhang commented May 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyu1994 commented May 7, 2021

shiyu1994 commented Jun 28, 2021 • edited Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jun 28, 2021 • edited Loading

cyfdecyf commented Jun 28, 2021

shiyu1994 commented Jun 29, 2021

shiyu1994 left a comment

Choose a reason for hiding this comment

cyfdecyf commented Jun 30, 2021

shiyu1994 commented Jun 30, 2021 • edited Loading

Willian-Zhang Apr 14, 2021 •

edited

Loading

Willian-Zhang Apr 14, 2021 •

edited

Loading

Willian-Zhang May 6, 2021 •

edited

Loading

Willian-Zhang Apr 14, 2021 •

edited

Loading

Willian-Zhang Apr 30, 2021 •

edited

Loading

Willian-Zhang Apr 30, 2021 •

edited

Loading

shiyu1994 commented Jun 28, 2021 •

edited

Loading

StrikerRUS commented Jun 28, 2021 •

edited

Loading

shiyu1994 commented Jun 30, 2021 •

edited

Loading