Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/340 gaussian nb #474

Merged
merged 232 commits into from
Mar 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
232 commits
Select commit Hold shift + click to select a range
3a70b0c
added workflow file
Dec 4, 2019
d849594
changed branches name
Dec 4, 2019
6cb1dc8
changed branches name
Dec 4, 2019
3f75147
build always on push
Dec 4, 2019
96a7bb4
changed to ubuntu and dependencies are now installed properly
Dec 4, 2019
85686f6
changed workflow to run on docker image
Dec 4, 2019
f56e8f3
using docker image properly
Dec 4, 2019
b364a26
changed to always use bash
Dec 4, 2019
68f2cf9
added debug code
Dec 4, 2019
9ec0f10
added earlier debug code
Dec 4, 2019
14f39e1
changed all to use /bin/bash
Dec 4, 2019
4c58b35
changed all to use sudo
Dec 4, 2019
cbe4375
running in a container now
Dec 4, 2019
1990539
still using /bin/bash
Dec 4, 2019
a27842f
removed sudo
Dec 4, 2019
ae82b27
added debug information for shell
Dec 5, 2019
3c43e35
added more shell debugging
Dec 5, 2019
81d44b0
fixed typo
Dec 5, 2019
8d2b2f4
added debug for standard shell
Dec 5, 2019
a0dae39
fixed more typos
Dec 5, 2019
b60b745
removed hostnamectl
Dec 5, 2019
54d5fc8
changed way os is fetched
Dec 5, 2019
29032a5
added module debug info
Dec 5, 2019
4dce1f9
moved debug code into bin/bash
Dec 5, 2019
b076661
removed loading of bashrc
Dec 5, 2019
2098a73
added installation to workflow
Dec 5, 2019
6ff3816
removed sudo
Dec 5, 2019
bb2d2ac
added more capabilities to docker
Dec 5, 2019
3370de7
changed options syntax
Dec 5, 2019
a573c3a
using different image
Dec 5, 2019
2b52001
using new image
Dec 5, 2019
ea3e932
removed typo
Dec 5, 2019
9e0ff24
changed debug information
Dec 5, 2019
014d2fd
added new no docker workflow
Dec 6, 2019
1cc73ba
fixed type
Dec 6, 2019
dc91953
added version info to checkout action
Dec 6, 2019
e66aaaa
added sudo
Dec 6, 2019
3c88f01
changed libopenmpi version
Dec 6, 2019
dd0e5bc
added python setup and testing stages
Dec 6, 2019
4c7e878
moved to python 3
Dec 6, 2019
b618eb1
added venv dependency
Dec 6, 2019
0c6e3e1
added y flag to work in automated shell
Dec 6, 2019
088b484
fixed typo and added correct location for test run
Dec 6, 2019
e068b65
added some debug code to find current location
Dec 6, 2019
961f881
removed folder switch
Dec 6, 2019
3e81fe1
fixed typo
Dec 6, 2019
32bf0ef
added pytest dependency
Dec 6, 2019
a2f43aa
changed run stage to activate virutalenv
Dec 6, 2019
62b097b
added more test stages
Dec 6, 2019
61c7242
added virtalenv activation for coverage creation
Dec 6, 2019
3f42f0f
added dev dependency
Dec 6, 2019
bbf313e
create new workflow using a ubuntu image
Dec 6, 2019
06f663c
fixed workflow syntax
Dec 6, 2019
f59ce5b
fixed type
Dec 6, 2019
8332eb8
removed version info
Dec 6, 2019
8eeae46
fixed spelling error
Dec 6, 2019
0f552a3
moved container usage
Dec 6, 2019
77f956b
added virtualenv steps
Dec 6, 2019
96d2d3e
checking home path
Dec 6, 2019
aecae36
adapted activate path
Dec 6, 2019
e526fe9
adding pip check
Dec 6, 2019
836b744
added shell checks
Dec 6, 2019
456027c
added code to install python packages and run tests
Dec 6, 2019
dec13af
fixed typo
Dec 6, 2019
7edde39
cleaned up code and added coverage combine steps
Dec 6, 2019
f645257
added bash in docker file
Dec 10, 2019
ba4403f
removed bash from docker due to not working properly
Dec 10, 2019
793eea9
removed bash from docker due to not working properly
Dec 10, 2019
02949a5
added fedora step to workflow
Dec 10, 2019
841a2dc
fixed type
Dec 10, 2019
f6db040
fixed another typo
Dec 10, 2019
2587222
changing ubuntu version
Dec 10, 2019
c7d7029
added volume
Dec 10, 2019
4d3c53e
sourcing module now
Dec 10, 2019
6595c3d
added options to docker
Dec 10, 2019
ea9e6fa
activating modules
Dec 10, 2019
597d0ec
moved module activation to docker file
Dec 10, 2019
e3a46af
removed volumes and added number of cpus
Dec 10, 2019
5914121
added oversubscribe to allow more than 2 parallel jobs
Dec 10, 2019
5da6f3c
fixed typo
Dec 10, 2019
e53dc5e
created build matrix
Dec 10, 2019
ca63124
fixed build matrix
Dec 10, 2019
2e1e7dc
added number of processes to build matrix
Dec 10, 2019
e1b9370
added codecv action
Dec 10, 2019
9212f0f
added pre commit execution
Dec 10, 2019
7a8d228
fixed action import
Dec 10, 2019
eab67e9
fixed missing dollar sign
Dec 10, 2019
f26e4ba
removed wrong dollar sign
Dec 10, 2019
17cec3d
added git installation
Dec 10, 2019
58cf0ca
Merge branch 'master' into features/424-github-actions
TheSlimvReal Dec 10, 2019
139dca8
Merge branch 'master' into features/424-github-actions
coquelin77 Dec 12, 2019
64f3147
Merge branch 'master' into features/424-github-actions
coquelin77 Dec 17, 2019
aba2b80
Merge branch 'master' into features/424-github-actions
coquelin77 Dec 18, 2019
696885e
codecov fails if an error occurs in this step
Dec 19, 2019
a27c24b
updated docker image
Dec 19, 2019
ad1f01a
added deploy workflow
Dec 19, 2019
b38d0a5
Merge branch 'features/424-github-actions' of https://github.com/helm…
Dec 19, 2019
4b8875d
updated docker images with wheel
Dec 19, 2019
725806f
added stage to build python package
Dec 19, 2019
8897d2b
added docker image to publish workflow
Dec 19, 2019
ee20c60
undone wheel dependency
Dec 19, 2019
0ebb316
changed workflow to run without docker image
Dec 19, 2019
1de5eb9
changed name
Dec 19, 2019
9a8fe32
fixed workflow file
Dec 19, 2019
c939a7c
added correct versions to actions
Dec 19, 2019
b3b76a6
fixed action paths
Dec 19, 2019
dec2fa1
fixed the token for pypi
Dec 19, 2019
fbdcd8f
Merge branch 'master' into features/424-github-actions
TheSlimvReal Dec 19, 2019
d9e6648
Merge remote-tracking branch 'origin/master' into features/424-github…
Dec 19, 2019
83462ad
removed travis file
Dec 19, 2019
183334f
Introducing test failure to check GitHub Workflows behaviour
ClaudiaComito Jan 9, 2020
72bfd89
Formatting
ClaudiaComito Jan 9, 2020
778ff73
More formatting and blake pickiness
ClaudiaComito Jan 9, 2020
5d5efb3
Next attempt at failing
ClaudiaComito Jan 9, 2020
d1db24e
Undoing all failed failure attempts and bringing code to original state
ClaudiaComito Jan 9, 2020
b4fab82
Adding new submodule
ClaudiaComito Jan 9, 2020
2f3fb66
Adding new class
ClaudiaComito Jan 9, 2020
3f90f02
GaussianNB first pass, replaced numpy with heat calls, added input sa…
ClaudiaComito Jan 16, 2020
ca33cd3
Removed unused call to sklearn _check_X
ClaudiaComito Jan 16, 2020
c7896b9
Replaced call to sklearn.utils.validation.check_X_y with basic shape …
ClaudiaComito Jan 16, 2020
4ca18d8
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Jan 16, 2020
cd55d73
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Jan 16, 2020
17551db
Added basic checks for sample_weight
ClaudiaComito Jan 16, 2020
57c96df
Added _check_partial_fit_first_call as a staticmethod for now
ClaudiaComito Jan 16, 2020
50f36a2
Removed obsolete comment
ClaudiaComito Jan 16, 2020
ae84b5b
Added _BaseNB class from scikit_learn
ClaudiaComito Jan 16, 2020
6cb88e8
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Jan 16, 2020
3dc93b6
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Jan 23, 2020
5459a19
Moved relevant _BaseNB methods to GaussianNB class, removed _BaseNB f…
ClaudiaComito Jan 23, 2020
0f66a66
Integrate heat.core.naive_bayes in set up.
ClaudiaComito Jan 23, 2020
e1e8542
Formatting
ClaudiaComito Jan 23, 2020
882ba56
Removing sklearn-specific validation calls for now.
ClaudiaComito Jan 23, 2020
04d41c2
Replacing np.unique/np.sort calls with ht.unique(sorted=True) calls
ClaudiaComito Jan 23, 2020
a770dec
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Jan 24, 2020
1a7e70d
Replacing np.in1d call with ht.eq, equivalent in this context.
ClaudiaComito Jan 25, 2020
4affd6d
_partial_fit(), temporary (hacky) replacement for np.searchsorted.
ClaudiaComito Jan 25, 2020
293a54d
Improved temporary searchsorted, debugging
ClaudiaComito Jan 26, 2020
b4cab25
Adapted joint_log_likelihood to absence of append() for torch/heat te…
ClaudiaComito Jan 26, 2020
1198f05
Adapted predict() to ht.argmax returning a heat tensor
ClaudiaComito Jan 26, 2020
14ffcfc
Implemented GaussianNB.logsumexp() (hacky early version)
ClaudiaComito Jan 26, 2020
8ea2914
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Jan 26, 2020
3456ae0
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Jan 29, 2020
c33a2a3
Modified _joint_log_likelihood to not rely on append(). No need to tr…
ClaudiaComito Feb 1, 2020
bf20bfe
logsumexp fixes
ClaudiaComito Feb 2, 2020
2bb84c5
Removed print/debugging statements
ClaudiaComito Feb 3, 2020
2638a78
formatting
ClaudiaComito Feb 3, 2020
297423a
Fixed mistake in shape of joint_log_likelihood tensor
ClaudiaComito Feb 3, 2020
80c22fb
Implementing test_gaussiannb(). First pass, test locally.
ClaudiaComito Feb 3, 2020
9d61505
Tidying up comments and #TODOs
ClaudiaComito Feb 3, 2020
7ac033a
Implemented testing of distributed GaussianNB (fails).
ClaudiaComito Feb 3, 2020
3c2f4e1
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Feb 3, 2020
a7a80db
Distributed __getitem__ now returns tensor of values at LIST of indic…
ClaudiaComito Feb 5, 2020
3911a12
Test for distributed case where data and labels are split along axis 0.
ClaudiaComito Feb 5, 2020
374d460
Enforce split=None for list of unique labels for now.
ClaudiaComito Feb 5, 2020
b21204f
Updated changelog
ClaudiaComito Feb 5, 2020
7706348
Adding back .travis.yml
ClaudiaComito Feb 6, 2020
731d99f
Removed outdated comment line
ClaudiaComito Feb 6, 2020
a818ec2
Refined conditional statements
ClaudiaComito Feb 6, 2020
bfd3244
formatting
ClaudiaComito Feb 6, 2020
daed150
Renamed internal functions according to HeAT convention (starting wit…
ClaudiaComito Feb 6, 2020
e3f9dbd
Testing gnb predictions vs. test labels
ClaudiaComito Feb 6, 2020
1d5a30d
Resolved flake8 conflicts
ClaudiaComito Feb 6, 2020
73e5bb9
More flake-iness
ClaudiaComito Feb 6, 2020
58bf7cd
Line breaks after """
ClaudiaComito Feb 7, 2020
ac4436a
Docs rewording
ClaudiaComito Feb 7, 2020
5bc63ff
In-place resplitting
ClaudiaComito Feb 7, 2020
1a0ee4d
Removed confusing reference to scikit-learn version 0.17
ClaudiaComito Feb 7, 2020
552119d
__check_partial_fit_first_call(clf, ...) --> __check_partial_fit_firs…
ClaudiaComito Feb 7, 2020
53e4c67
Reference to #351, added pointer to topic
ClaudiaComito Feb 7, 2020
2580f43
Formatting error messages
ClaudiaComito Feb 7, 2020
6da348d
Added dtype, device calls where missing
ClaudiaComito Feb 7, 2020
045cec6
Added references to Issue #468, sanitation TODOs
ClaudiaComito Feb 7, 2020
36ab57c
Added missing line breaks
ClaudiaComito Feb 7, 2020
058ceec
Added missing line breaks
ClaudiaComito Feb 7, 2020
ba138ef
Added docs for __joint_log_likelihood
ClaudiaComito Feb 7, 2020
e0a2587
Merge branch 'master' into features/340-GaussianNB
ClaudiaComito Feb 7, 2020
f84beef
Move naive_bayes one level up
ClaudiaComito Feb 8, 2020
f523587
Removed heat/core/naive_bayes
ClaudiaComito Feb 8, 2020
7d1412d
Updated import heat.core.naive_bayes --> heat.naive_bayes
ClaudiaComito Feb 8, 2020
21f192e
Fixing flake8 "E402 module level import not at top of file" complaint
ClaudiaComito Feb 8, 2020
00d6123
Removed -quiet option from pip install.
ClaudiaComito Feb 10, 2020
b356fea
Import version after sys.path.append("./heat/core") in spite of flake…
ClaudiaComito Feb 10, 2020
535e397
Building problems. Setting --progress-bar off for pip install
ClaudiaComito Feb 10, 2020
20fe2d5
Uploading standard sklearn train/test iris data.
ClaudiaComito Feb 10, 2020
4388207
Rewrote test_gaussiannb to compare to sklearn gnb() without importing…
ClaudiaComito Feb 10, 2020
5e56fc1
Bypassing test with 7 procs for now as test dataset is too small
ClaudiaComito Feb 10, 2020
0ba4a73
Setting pip install back to -quiet
ClaudiaComito Feb 10, 2020
2fccd8f
Increasing test coverage. Testing exceptions
ClaudiaComito Feb 10, 2020
266c091
Switching pip install out of quiet mode again
ClaudiaComito Feb 10, 2020
d423142
Added checks for sample_weight = 0 locally
ClaudiaComito Feb 10, 2020
8a80683
Added test case for sample_weight not None (split=None)
ClaudiaComito Feb 10, 2020
56713c8
__getitem__ call on sample_weight
ClaudiaComito Feb 11, 2020
78532f7
testing GaussianNB when sample_weight is not None, sample_weight not …
ClaudiaComito Feb 11, 2020
0a37705
nonzero now returns 1-D tensor if input is 1-D
ClaudiaComito Feb 17, 2020
f2c0404
Modified __getitem__ for case in which the only input (key) is a list…
ClaudiaComito Feb 19, 2020
310a7c2
ht.average(), improved weighted average when weights are 1-D and inpu…
ClaudiaComito Feb 19, 2020
e16e118
Adapted test_average() to changes in statistics.average
ClaudiaComito Feb 19, 2020
1b16fce
Small change to __getitem__ call
ClaudiaComito Feb 19, 2020
a956f25
Added test for GaussianNB with weights, both local and distributed
ClaudiaComito Feb 19, 2020
7399eb5
Updated test_average()
ClaudiaComito Feb 19, 2020
061fc27
Skipping test on 7 nodes for now. Cf. #490
ClaudiaComito Feb 19, 2020
f8e5d78
Extending test coverage: weighted average with 3d weights.
ClaudiaComito Feb 20, 2020
95de460
ht.average(), NotImplementedError if weights.split != x.split until #…
ClaudiaComito Feb 20, 2020
e14e8b0
test_average(), test NotImplementedError when weights.split != x.split
ClaudiaComito Feb 20, 2020
9fa985d
Extending test coverage: testing exceptions.
ClaudiaComito Feb 20, 2020
005ccf6
Increasing test coverage.
ClaudiaComito Feb 20, 2020
c6ae05c
Tests fail on 7 nodes, #490
ClaudiaComito Feb 20, 2020
92d4a48
Added tests for gnb.predict_proba()
ClaudiaComito Feb 28, 2020
2c3e4d9
Shape of log_prob_x must match __joint_log_likelihood(X) output
ClaudiaComito Feb 28, 2020
6e6c1f2
scikit-learn predict_proba output for testing/comparison
ClaudiaComito Feb 28, 2020
c6a85be
Resolving conflicts with master
ClaudiaComito Feb 28, 2020
bbb5610
Updated documentation and example of GaussianNB
ClaudiaComito Mar 1, 2020
165cfbb
Adapted documentation from scikit-learn to HeAT.
ClaudiaComito Mar 2, 2020
887f55c
ht.var() now returns same dtype as input tensor in distributed mode a…
ClaudiaComito Mar 2, 2020
50c113a
Extending test coverage to gnb attributes
ClaudiaComito Mar 2, 2020
1596496
turn gnb.priors into ht.DNDarray only if it isn't already
ClaudiaComito Mar 2, 2020
0f4544a
Test coverage gnb.partial_fit(), gnb.priors
ClaudiaComito Mar 2, 2020
2bac14e
Removed dtype mismatch embuggerance between priors.sum() and 1.0
ClaudiaComito Mar 2, 2020
d890e60
More tests
ClaudiaComito Mar 2, 2020
2a14bf5
Merge branch 'master' into features/340-GaussianNB
coquelin77 Mar 2, 2020
c15b0c8
Improving tests
ClaudiaComito Mar 3, 2020
b4cdd20
Merge branch 'features/340-GaussianNB' of https://github.com/helmholt…
ClaudiaComito Mar 3, 2020
ad8804d
7-node test
ClaudiaComito Mar 3, 2020
0ca9244
Removing 7-node testing again
ClaudiaComito Mar 3, 2020
b769ad2
Extending test coverage
ClaudiaComito Mar 3, 2020
74b2d5b
Improved conditional statement as per review
ClaudiaComito Mar 3, 2020
915ee4a
Added references section to documentation
ClaudiaComito Mar 3, 2020
2083434
Added References section to __update_mean_variance documentation
ClaudiaComito Mar 3, 2020
907e38f
Test coverage
ClaudiaComito Mar 3, 2020
4dcfa97
More test coverage
ClaudiaComito Mar 3, 2020
721157b
Removed dead code
ClaudiaComito Mar 3, 2020
a829607
added extra dtype calls
coquelin77 Mar 3, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ before_install:

install:
- pre-commit run --all-files
- docker exec -t unittest /bin/bash -c '. /root/.bashrc && pip install -q -e .[hdf5,netcdf] && pip list'
- docker exec -t unittest /bin/bash -c '. /root/.bashrc && pip install --progress-bar off -e .[hdf5,netcdf] && pip list'

script:
# Running multiple mpi process count, generate a unique coverage report for each one and merge into one report
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# v0.3.0

- [#454](https://github.com/helmholtz-analytics/heat/issues/454) Update lasso example
- [#474](https://github.com/helmholtz-analytics/heat/pull/474) New feature: distributed Gaussian Naive Bayes classifier
- [#473](https://github.com/helmholtz-analytics/heat/issues/473) Matmul now will not split any of the input matrices if both have `split=None`. To toggle splitting of one input for increased speed use the allow_resplit flag.
- [#473](https://github.com/helmholtz-analytics/heat/issues/473) `dot` handles 2 split None vectors correctly now
- [#470](https://github.com/helmholtz-analytics/heat/pull/470) Enhancement: Accelerate distance calculations in kmeans clustering by introduction of new module spatial.distance
Expand Down
1 change: 1 addition & 0 deletions heat/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from . import core
from . import cluster
from . import naive_bayes
from . import regression
from . import spatial
from .core import *
Expand Down
66 changes: 42 additions & 24 deletions heat/core/dndarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1290,11 +1290,13 @@ def __getitem__(self, key):
(1/2) >>> tensor([0.])
(2/2) >>> tensor([0., 0.])
"""

l_dtype = self.dtype.torch_type()
if isinstance(key, DNDarray) and key.gshape[-1] != len(self.gshape):
key = tuple(x.item() for x in key)

if not self.is_distributed():

if not self.comm.size == 1:
if isinstance(key, DNDarray) and key.gshape[-1] == len(self.gshape):
# this will return a 1D array as the shape cannot be determined automatically
Expand Down Expand Up @@ -1329,6 +1331,7 @@ def __getitem__(self, key):
)

else:

_, _, chunk_slice = self.comm.chunk(self.shape, self.split)
chunk_start = chunk_slice[self.split].start
chunk_end = chunk_slice[self.split].stop
Expand Down Expand Up @@ -1373,7 +1376,6 @@ def __getitem__(self, key):
# handle the dimensional reduction for integers
ints = sum([isinstance(it, int) for it in key])
gout = gout[: len(gout) - ints]

if self.split >= len(gout):
new_split = len(gout) - 1 if len(gout) - 1 > 0 else 0
else:
Expand All @@ -1400,30 +1402,46 @@ def __getitem__(self, key):
key[self.split] = slice(min(hold), max(hold) + 1, key[self.split].step)
arr = self.__array[tuple(key)]
gout = list(arr.shape)

# if the given axes are not splits (must be ints for python)
# this means the whole slice is on one node
elif key[self.split] in range(chunk_start, chunk_end):
key = list(key)
key[self.split] = key[self.split] - chunk_start
arr = self.__array[tuple(key)]
gout = list(arr.shape)
elif key[self.split] < 0 and self.gshape[self.split] + key[self.split] in range(
chunk_start, chunk_end
):
key = list(key)
key[self.split] = key[self.split] + chunk_end - chunk_start
arr = self.__array[tuple(key)]
gout = list(arr.shape)
else:
warnings.warn(
"This process (rank: {}) is without data after slicing, running the .balance_() function is recommended".format(
self.comm.rank
),
ResourceWarning,
)
# arr is empty
# gout is all 0s and is the proper shape
# if the given axes are not splits (must be ints OR LISTS for python)
# this means the whole slice is on one node
if isinstance(key, list):
indices = key
else:
indices = key[self.split]
key = list(key)
if isinstance(indices, list):
indices = [
index + self.gshape[self.split] if index < 0 else index
for index in indices
]
sorted_key_along_split = sorted(indices)
if sorted_key_along_split[0] in range(
chunk_start, chunk_end
) and sorted_key_along_split[-1] in range(chunk_start, chunk_end):
indices = [index - chunk_start for index in indices]
arr = self.__array[indices]
gout = list(arr.shape)

elif isinstance(key[self.split], int):
key[self.split] = (
key[self.split] + self.gshape[self.split]
if key[self.split] < 0
else key[self.split]
)
if key[self.split] in range(chunk_start, chunk_end):
key[self.split] = key[self.split] - chunk_start
arr = self.__array[tuple(key)]
gout = list(arr.shape)
if 0 in arr.shape:
# arr is empty
# gout is all 0s and is the proper shape
warnings.warn(
"This process (rank: {}) is without data after slicing, running the .balance_() function is recommended".format(
self.comm.rank
),
ResourceWarning,
)

# if the given axes are only a slice
elif isinstance(key, slice) and self.split == 0:
Expand Down
12 changes: 7 additions & 5 deletions heat/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,20 +58,22 @@ def nonzero(a):
[0/1] tensor([[4, 5, 6]])
[1/1] tensor([[7, 8, 9]])
"""

if a.split is None:
# if there is no split then just return the values from torch
return factories.array(
torch.nonzero(a._DNDarray__array), is_split=a.split, device=a.device, comm=a.comm
)
lcl_nonzero = torch.nonzero(a._DNDarray__array)
is_split = None
else:
# a is split
lcl_nonzero = torch.nonzero(a._DNDarray__array)
_, _, slices = a.comm.chunk(a.shape, a.split)
lcl_nonzero[..., a.split] += slices[a.split].start
gout = list(lcl_nonzero.size())
gout[0] = a.comm.allreduce(gout[0], MPI.SUM)
return factories.array(lcl_nonzero, is_split=0, device=a.device, comm=a.comm)
is_split = 0

if a.numdims == 1:
lcl_nonzero = lcl_nonzero.squeeze()
return factories.array(lcl_nonzero, is_split=is_split, device=a.device, comm=a.comm)


def where(cond, x=None, y=None):
Expand Down
55 changes: 31 additions & 24 deletions heat/core/statistics.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,34 +319,36 @@ def average(x, axis=None, weights=None, returned=False):
if x.gshape != weights.gshape:
if axis is None:
raise TypeError("Axis must be specified when shapes of x and weights differ.")
if isinstance(axis, tuple):
elif isinstance(axis, tuple):
raise NotImplementedError("Weighted average over tuple axis not implemented yet.")
if weights.numdims != 1:
raise TypeError("1D weights expected when shapes of x and weights differ.")
if weights.gshape[0] != x.gshape[axis]:
raise ValueError("Length of weights not compatible with specified axis.")

wgt = factories.empty_like(weights, device=x.device)
wgt._DNDarray__array = weights._DNDarray__array
wgt._DNDarray__split = weights.split

# Broadcast weights along specified axis if necessary
if wgt.numdims == 1 and x.numdims != 1:
if wgt.split is not None:
wgt.resplit_(None)
weights_newshape = tuple(1 if i != axis else x.gshape[axis] for i in range(x.numdims))
wgt._DNDarray__array = torch.reshape(wgt._DNDarray__array, weights_newshape)
wgt._DNDarray__gshape = weights_newshape
wgt_lshape = tuple(
weights.lshape[0] if dim == axis else 1 for dim in list(range(x.numdims))
)
wgt_slice = [slice(None) if dim == axis else 0 for dim in list(range(x.numdims))]
wgt_split = None if weights.split is None else axis
wgt = factories.empty(wgt_lshape, dtype=weights.dtype, device=x.device)
wgt._DNDarray__array[wgt_slice] = weights._DNDarray__array
wgt = factories.array(wgt._DNDarray__array, is_split=wgt_split)
else:
if x.comm.is_distributed():
if x.split is not None and weights.split != x.split and weights.numdims != 1:
# fix after Issue #425 is solved
raise NotImplementedError(
"weights.split does not match data.split: not implemented yet."
)
wgt = factories.empty_like(weights, device=x.device)
wgt._DNDarray__array = weights._DNDarray__array

cumwgt = wgt.sum(axis=axis)

if logical.any(cumwgt == 0.0):
raise ZeroDivisionError("Weights sum to zero, can't be normalized")

# Distribution: if x is split, split to weights along same dimension if possible
if x.split is not None and wgt.split != x.split:
if wgt.gshape[x.split] != 1:
wgt.resplit_(x.split)

result = (x * wgt).sum(axis=axis) / cumwgt

if returned:
Expand Down Expand Up @@ -1222,12 +1224,12 @@ def reduce_vars_elementwise(output_shape_i):
mu = torch.mean(x._DNDarray__array, dim=axis)
var = torch.var(x._DNDarray__array, dim=axis, unbiased=bessel)
else:
mu = factories.zeros(output_shape_i, device=x.device)
var = factories.zeros(output_shape_i, device=x.device)
mu = factories.zeros(output_shape_i, dtype=x.dtype, device=x.device)
var = factories.zeros(output_shape_i, dtype=x.dtype, device=x.device)

var_shape = list(var.shape) if list(var.shape) else [1]

var_tot = factories.zeros(([x.comm.size, 2] + var_shape), device=x.device)
var_tot = factories.zeros(([x.comm.size, 2] + var_shape), dtype=x.dtype, device=x.device)
n_tot = factories.zeros(x.comm.size, device=x.device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible dtype problem here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is var() supposed to return a float32? Even if x is float64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should not. i guess i missed a couple dtype calls here. can you add them for me? it should be just in this spot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought, so is this the possible dtype problem you were talking about? I guess I just misunderstood your first comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just also realized that the dtype of n_tot doesnt matter because it is only used internally

var_tot[x.comm.rank, 0, :] = var
var_tot[x.comm.rank, 1, :] = mu
Expand Down Expand Up @@ -1259,8 +1261,8 @@ def reduce_vars_elementwise(output_shape_i):
mu_in = 0.0

n = x.lnumel
var_tot = factories.zeros((x.comm.size, 3), device=x.device)
var_proc = factories.zeros((x.comm.size, 3), device=x.device)
var_tot = factories.zeros((x.comm.size, 3), dtype=x.dtype, device=x.device)
var_proc = factories.zeros((x.comm.size, 3), dtype=x.dtype, device=x.device)
var_proc[x.comm.rank] = var_in, mu_in, float(n)
x.comm.Allreduce(var_proc, var_tot, MPI.SUM)

Expand Down Expand Up @@ -1322,15 +1324,20 @@ def reduce_vars_elementwise(output_shape_i):

if x.split is None: # x is *not* distributed -> no need to distributed
return factories.array(
torch.var(x._DNDarray__array, dim=axis, unbiased=bessel), device=x.device
torch.var(x._DNDarray__array, dim=axis, unbiased=bessel),
dtype=x.dtype,
device=x.device,
)
elif axis == x.split: # x is distributed and axis chosen is == to split
return reduce_vars_elementwise(output_shape)
else:
# singular axis given (axis) not equal to split direction (x.split)
lcl = torch.var(x._DNDarray__array, dim=axis, keepdim=False)
return factories.array(
lcl, is_split=x.split if axis > x.split else x.split - 1, device=x.device
lcl,
is_split=x.split if axis > x.split else x.split - 1,
dtype=x.dtype,
device=x.device,
)
else:
raise TypeError("axis (axis) must be an int, tuple, list, etc.; currently it is {}. ")
44 changes: 35 additions & 9 deletions heat/core/tests/test_statistics.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ def test_average(self):
)
size = random_volume.comm.size
random_weights = ht.array(
torch.randn((3 * size,), dtype=torch.float64, device=device), device=ht_device
torch.randn((3 * size,), dtype=torch.float64, device=device), split=0, device=ht_device
)
avg_volume = ht.average(random_volume, weights=random_weights, axis=1)
np_avg_volume = np.average(random_volume.numpy(), weights=random_weights.numpy(), axis=1)
Expand All @@ -334,6 +334,28 @@ def test_average(self):
self.assertEqual(avg_volume_with_cumwgt[1].gshape, avg_volume_with_cumwgt[0].gshape)
self.assertEqual(avg_volume_with_cumwgt[1].split, avg_volume_with_cumwgt[0].split)

# check weighted average over all float elements of split 3d tensor (3d weights)

random_weights_3d = ht.array(
torch.randn((3, 3, 3), dtype=torch.float64, device=device), is_split=1, device=ht_device
)
avg_volume = ht.average(random_volume, weights=random_weights_3d, axis=1)
np_avg_volume = np.average(random_volume.numpy(), weights=random_weights.numpy(), axis=1)
self.assertIsInstance(avg_volume, ht.DNDarray)
self.assertEqual(avg_volume.shape, (3, 3))
self.assertEqual(avg_volume.lshape, (3, 3))
self.assertEqual(avg_volume.dtype, ht.float64)
self.assertEqual(avg_volume._DNDarray__array.dtype, torch.float64)
self.assertEqual(avg_volume.split, None)
self.assertAlmostEqual(avg_volume.numpy().all(), np_avg_volume.all())
avg_volume_with_cumwgt = ht.average(
random_volume, weights=random_weights, axis=1, returned=True
)
self.assertIsInstance(avg_volume_with_cumwgt, tuple)
self.assertIsInstance(avg_volume_with_cumwgt[1], ht.DNDarray)
self.assertEqual(avg_volume_with_cumwgt[1].gshape, avg_volume_with_cumwgt[0].gshape)
self.assertEqual(avg_volume_with_cumwgt[1].split, avg_volume_with_cumwgt[0].split)

# check average over all float elements of split 3d tensor, tuple axis
random_volume = ht.random.randn(3, 3, 3, split=0, device=ht_device)
avg_volume = ht.average(random_volume, axis=(1, 2))
Expand All @@ -347,16 +369,16 @@ def test_average(self):

# check weighted average over all float elements of split 5d tensor, along split axis
random_5d = ht.random.randn(random_volume.comm.size, 2, 3, 4, 5, split=0, device=ht_device)
axis = 1
random_weights = ht.random.randn(random_5d.gshape[axis], device=ht_device)
axis = random_5d.split
random_weights = ht.random.randn(random_5d.gshape[axis], split=0, device=ht_device)
avg_5d = random_5d.average(weights=random_weights, axis=axis)

self.assertIsInstance(avg_5d, ht.DNDarray)
self.assertEqual(avg_5d.gshape, (size, 3, 4, 5))
self.assertEqual(avg_5d.gshape, (2, 3, 4, 5))
self.assertLessEqual(avg_5d.lshape[1], 3)
self.assertEqual(avg_5d.dtype, ht.float64)
self.assertEqual(avg_5d._DNDarray__array.dtype, torch.float64)
self.assertEqual(avg_5d.split, 0)
self.assertEqual(avg_5d.split, None)

# check exceptions
with self.assertRaises(TypeError):
Expand All @@ -372,12 +394,16 @@ def test_average(self):
)
with self.assertRaises(TypeError):
ht.average(random_5d, weights=random_weights, axis=axis)
random_weights = ht.random.randn(random_5d.gshape[axis] + 1, device=ht_device)
random_shape_weights = ht.random.randn(random_5d.gshape[axis] + 1, device=ht_device)
with self.assertRaises(ValueError):
ht.average(random_5d, weights=random_weights, axis=axis)
random_weights = ht.zeros((random_5d.gshape[axis]), device=ht_device)
ht.average(random_5d, weights=random_shape_weights, axis=axis)
zero_weights = ht.zeros((random_5d.gshape[axis]), split=0, device=ht_device)
with self.assertRaises(ZeroDivisionError):
ht.average(random_5d, weights=random_weights, axis=axis)
ht.average(random_5d, weights=zero_weights, axis=axis)
weights_5d_split_mismatch = ht.ones(random_5d.gshape, split=-1, device=ht_device)
with self.assertRaises(NotImplementedError):
ht.average(random_5d, weights=weights_5d_split_mismatch, axis=axis)

with self.assertRaises(TypeError):
ht_array.average(axis=1.1)
with self.assertRaises(TypeError):
Expand Down
Loading