Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Allow to pass Arrow table and array as init scores #6167

Merged
merged 82 commits into from
Dec 4, 2023
Merged
Changes from 1 commit
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
ab2d5e2
Add Arrow support to Python API
borchero Jul 31, 2023
570ca64
Merge branch 'master' into arrow-support
borchero Aug 5, 2023
c21fab4
Fix lint
borchero Aug 5, 2023
2cd4302
Fix isort
borchero Aug 5, 2023
71957f6
[python-package] Allow to pass Arrow table as training data
borchero Aug 12, 2023
175fb13
Merge branch 'master' into arrow-support-training-data
borchero Aug 12, 2023
32dfb11
Remove change
borchero Aug 12, 2023
b5f0676
Implement JL comments
borchero Aug 12, 2023
cca3b37
Fix isort
borchero Aug 12, 2023
001139a
Remove testcase
borchero Aug 12, 2023
5861ca6
Adjust pyarrow version
borchero Aug 12, 2023
54d171c
Revert gitignore
borchero Aug 21, 2023
a87a15b
Fix lint
borchero Sep 5, 2023
8cda7cd
Merge branch 'master' into arrow-support-training-data
borchero Sep 5, 2023
6b4245a
Increase timeout for bdist_wheel build
borchero Sep 6, 2023
14a9326
Fix layout
borchero Sep 6, 2023
854f306
Add newline
borchero Sep 6, 2023
269582c
Fix typo
borchero Sep 11, 2023
9164040
Merge branch 'master' into arrow-support-training-data
borchero Sep 11, 2023
9a0a18d
Merge branch 'master' into arrow-support-training-data
borchero Sep 15, 2023
e5540cd
Remove arrow.py
borchero Sep 15, 2023
98997bf
Merge branch 'master' into arrow-support-training-data
jameslamb Sep 26, 2023
4a66cba
Merge branch 'master' into arrow-support-training-data
borchero Oct 12, 2023
f44421e
Fix cpp tests
borchero Oct 12, 2023
80b0aa3
Fix tests
borchero Oct 12, 2023
1869cfb
Fix omp parallel
borchero Oct 12, 2023
ba62bcc
Add missing <cmath> header
borchero Oct 12, 2023
db449e1
Fix cpplint
borchero Oct 12, 2023
3dab653
Disable arrow tests
borchero Oct 12, 2023
840cba9
Try fixing memory issue in tests
borchero Oct 13, 2023
19b210b
Try chunking in test
borchero Oct 13, 2023
059419d
Fix lint
borchero Oct 13, 2023
36e7bf4
Merge branch 'master' into arrow-support-training-data
borchero Oct 25, 2023
143a247
Implement review comments
borchero Oct 25, 2023
bb97817
Merge branch 'master' into arrow-support-training-data
jameslamb Oct 30, 2023
62431f2
Uninstall optional dependencies correctly
borchero Oct 30, 2023
34ee108
[python-package] Allow to pass Arrow array as labels
borchero Oct 30, 2023
90a2c1f
Fix lint
borchero Oct 30, 2023
6b65bcf
Fix lint
borchero Oct 30, 2023
ec33f75
WIP: [python-package] Allow to pass Arrow array as weights
borchero Oct 30, 2023
20a23b8
Fix lint
borchero Oct 30, 2023
ccdb0ba
Push
borchero Oct 30, 2023
7dbce53
Remove test
borchero Oct 30, 2023
ce69120
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Oct 30, 2023
e1593c2
Groups
borchero Oct 30, 2023
0af7a7c
[python-package] Allow to pass Arrow table as training data
borchero Oct 30, 2023
45a67a6
Merge branch 'arrow-support-training-data' into arrow-support-labels
borchero Oct 30, 2023
80c12c0
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Oct 30, 2023
221cba4
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Oct 30, 2023
15c8637
Fix isort
borchero Oct 30, 2023
b1d2071
WIP: [python-package] Allow to pass Arrow table and array as init scores
borchero Oct 30, 2023
06bdce2
Merge branch 'master' into arrow-support-labels
borchero Nov 2, 2023
75a980e
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Nov 2, 2023
a53e8bb
Merge branch 'arrow-support-weights' into arrow-support-init-scores
borchero Nov 2, 2023
3d3ffb1
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Nov 2, 2023
591fb71
Merge branch 'arrow-support-groups' into arrow-support-init-scores
borchero Nov 2, 2023
f7c67e7
Implement guolinke's review
borchero Nov 7, 2023
91fade9
Merge branch 'master' into arrow-support-labels
jameslamb Nov 7, 2023
09ad33b
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Nov 7, 2023
33f3e44
Merge branch 'master' into arrow-support-labels
borchero Nov 7, 2023
cd556da
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Nov 7, 2023
678ae7d
Use np_assert_array_equal
borchero Nov 7, 2023
5331202
Implement jameslamb's review comments
borchero Nov 8, 2023
74910d4
Merge branch 'master' into arrow-support-weights
jameslamb Nov 8, 2023
5041282
Merge branch 'master' into arrow-support-weights
jameslamb Nov 13, 2023
04f0f21
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Nov 14, 2023
5e2baa1
Fix
borchero Nov 14, 2023
ff5c9f8
Merge branch 'master' into arrow-support-groups
borchero Nov 14, 2023
0f56ea0
Fix and implement review comments
borchero Nov 15, 2023
797cc3a
Fix
borchero Nov 15, 2023
8714625
Fix test
borchero Nov 16, 2023
acd916e
Fix
borchero Nov 16, 2023
c00b841
Merge branch 'master' into arrow-support-groups
borchero Nov 22, 2023
9b07160
Add tests for empty chunks
borchero Nov 22, 2023
79d050b
Fix lint
borchero Nov 22, 2023
caf66ee
Merge branch 'arrow-support-groups' into arrow-support-init-scores
borchero Nov 22, 2023
64d082f
Merge branch 'master' into arrow-support-init-scores
borchero Nov 22, 2023
a662a80
Fix
borchero Nov 22, 2023
e965af2
Fix
borchero Nov 22, 2023
c5acae7
Stricter test
borchero Nov 22, 2023
2e00f84
Merge branch 'master' into arrow-support-init-scores
jameslamb Nov 30, 2023
997705e
Merge branch 'master' into arrow-support-init-scores
jameslamb Dec 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Disable arrow tests
  • Loading branch information
borchero committed Oct 12, 2023
commit 3dab653d4ffe3448e427a9ce9e760882570bde0e
198 changes: 99 additions & 99 deletions tests/python_package_test/test_arrow.py
Original file line number Diff line number Diff line change
@@ -1,99 +1,99 @@
# coding: utf-8
import filecmp
import tempfile
from pathlib import Path
from typing import Any, Dict

import numpy as np
import pyarrow as pa
import pytest

import lightgbm as lgb

# ----------------------------------------------------------------------------------------------- #
# UTILITIES #
# ----------------------------------------------------------------------------------------------- #


def generate_simple_arrow_table() -> pa.Table:
columns = [
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint8()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int8()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint16()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int16()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint32()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int32()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint64()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int64()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.float32()),
pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.float64()),
]
return pa.Table.from_arrays(columns, names=[f"col_{i}" for i in range(len(columns))])


def generate_dummy_arrow_table() -> pa.Table:
col1 = pa.chunked_array([[1, 2, 3], [4, 5]], type=pa.uint8())
col2 = pa.chunked_array([[0.5, 0.6], [0.1, 0.8, 1.5]], type=pa.float32())
return pa.Table.from_arrays([col1, col2], names=["a", "b"])


def generate_random_arrow_table(num_columns: int, num_datapoints: int, seed: int) -> pa.Table:
columns = [generate_random_arrow_array(num_datapoints, seed + i) for i in range(num_columns)]
names = [f"col_{i}" for i in range(num_columns)]
return pa.Table.from_arrays(columns, names=names)


def generate_random_arrow_array(num_datapoints: int, seed: int) -> pa.ChunkedArray:
generator = np.random.default_rng(seed)
data = generator.standard_normal(num_datapoints)

# Set random nulls
indices = generator.choice(len(data), size=num_datapoints // 10)
data[indices] = None

# Split data into random chunks
n_chunks = generator.integers(1, num_datapoints // 3)
split_points = np.sort(generator.choice(np.arange(1, num_datapoints), n_chunks, replace=False))
split_points = np.concatenate([[0], split_points, [num_datapoints]])
chunks = [data[split_points[i] : split_points[i + 1]] for i in range(len(split_points) - 1)]
chunks = [chunk for chunk in chunks if len(chunk) > 0]

# Turn chunks into array
return pa.chunked_array(chunks, type=pa.float32())


def dummy_dataset_params() -> Dict[str, Any]:
return {
"min_data_in_bin": 1,
"min_data_in_leaf": 1,
}


# ----------------------------------------------------------------------------------------------- #
# UNIT TESTS #
# ----------------------------------------------------------------------------------------------- #

# ------------------------------------------- DATASET ------------------------------------------- #


@pytest.mark.parametrize(
("arrow_table", "dataset_params"),
[
(generate_simple_arrow_table(), dummy_dataset_params()),
(generate_dummy_arrow_table(), dummy_dataset_params()),
(generate_random_arrow_table(3, 1000, 42), {}),
(generate_random_arrow_table(100, 10000, 43), {}),
],
)
def test_dataset_construct_fuzzy(arrow_table: pa.Table, dataset_params: Dict[str, Any]):
arrow_dataset = lgb.Dataset(arrow_table, params=dataset_params)
arrow_dataset.construct()

pandas_dataset = lgb.Dataset(arrow_table.to_pandas(), params=dataset_params)
pandas_dataset.construct()

with tempfile.TemporaryDirectory() as t:
tmpdir = Path(t)
arrow_dataset._dump_text(tmpdir / "arrow.txt")
pandas_dataset._dump_text(tmpdir / "pandas.txt")
assert filecmp.cmp(tmpdir / "arrow.txt", tmpdir / "pandas.txt")
# # coding: utf-8
# import filecmp
# import tempfile
# from pathlib import Path
# from typing import Any, Dict

# import numpy as np
# import pyarrow as pa
# import pytest

# import lightgbm as lgb

# # ----------------------------------------------------------------------------------------------- #
# # UTILITIES #
# # ----------------------------------------------------------------------------------------------- #


# def generate_simple_arrow_table() -> pa.Table:
# columns = [
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint8()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int8()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint16()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int16()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint32()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int32()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.uint64()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.int64()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.float32()),
# pa.chunked_array([[1, 2, 3, 4, 5]], type=pa.float64()),
# ]
# return pa.Table.from_arrays(columns, names=[f"col_{i}" for i in range(len(columns))])


# def generate_dummy_arrow_table() -> pa.Table:
# col1 = pa.chunked_array([[1, 2, 3], [4, 5]], type=pa.uint8())
# col2 = pa.chunked_array([[0.5, 0.6], [0.1, 0.8, 1.5]], type=pa.float32())
# return pa.Table.from_arrays([col1, col2], names=["a", "b"])


# def generate_random_arrow_table(num_columns: int, num_datapoints: int, seed: int) -> pa.Table:
# columns = [generate_random_arrow_array(num_datapoints, seed + i) for i in range(num_columns)]
# names = [f"col_{i}" for i in range(num_columns)]
# return pa.Table.from_arrays(columns, names=names)


# def generate_random_arrow_array(num_datapoints: int, seed: int) -> pa.ChunkedArray:
# generator = np.random.default_rng(seed)
# data = generator.standard_normal(num_datapoints)

# # Set random nulls
# indices = generator.choice(len(data), size=num_datapoints // 10)
# data[indices] = None

# # Split data into random chunks
# n_chunks = generator.integers(1, num_datapoints // 3)
# split_points = np.sort(generator.choice(np.arange(1, num_datapoints), n_chunks, replace=False))
# split_points = np.concatenate([[0], split_points, [num_datapoints]])
# chunks = [data[split_points[i] : split_points[i + 1]] for i in range(len(split_points) - 1)]
# chunks = [chunk for chunk in chunks if len(chunk) > 0]

# # Turn chunks into array
# return pa.chunked_array(chunks, type=pa.float32())


# def dummy_dataset_params() -> Dict[str, Any]:
# return {
# "min_data_in_bin": 1,
# "min_data_in_leaf": 1,
# }


# # ----------------------------------------------------------------------------------------------- #
# # UNIT TESTS #
# # ----------------------------------------------------------------------------------------------- #

# # ------------------------------------------- DATASET ------------------------------------------- #


# @pytest.mark.parametrize(
# ("arrow_table", "dataset_params"),
# [
# (generate_simple_arrow_table(), dummy_dataset_params()),
# (generate_dummy_arrow_table(), dummy_dataset_params()),
# (generate_random_arrow_table(3, 1000, 42), {}),
# (generate_random_arrow_table(100, 10000, 43), {}),
# ],
# )
# def test_dataset_construct_fuzzy(arrow_table: pa.Table, dataset_params: Dict[str, Any]):
# arrow_dataset = lgb.Dataset(arrow_table, params=dataset_params)
# arrow_dataset.construct()

# pandas_dataset = lgb.Dataset(arrow_table.to_pandas(), params=dataset_params)
# pandas_dataset.construct()

# with tempfile.TemporaryDirectory() as t:
# tmpdir = Path(t)
# arrow_dataset._dump_text(tmpdir / "arrow.txt")
# pandas_dataset._dump_text(tmpdir / "pandas.txt")
# assert filecmp.cmp(tmpdir / "arrow.txt", tmpdir / "pandas.txt")