add test for dataclusters #54

stevenhua0320 · 2024-08-07T10:34:13Z

No description provided.

sbillinge

This is great, nice job!

It might be cleaner if we define an eq method in the class itself and then just assert that expected == actual.

Also, possibly passing in a dict to the test and using this to instantiate the class may make it more readable.

We are only testing one thing at the moment, which is fine, but we can decide if there are more edge cases we want to test. It can get time consuming so there is some balance between testing everything (better) and nothing (quicker in the short term).

To think about this, remember we are testing "behavior" not code per se. So putting text comments that capture what behavior each test is testing can be helpful.

stevenhua0320 · 2024-08-07T16:47:00Z

This is great, nice job!

It might be cleaner if we define an eq method in the class itself and then just assert that expected == actual.

Also, possibly passing in a dict to the test and using this to instantiate the class may make it more readable.

We are only testing one thing at the moment, which is fine, but we can decide if there are more edge cases we want to test. It can get time consuming so there is some balance between testing everything (better) and nothing (quicker in the short term).

To think about this, remember we are testing "behavior" not code per se. So putting text comments that capture what behavior each test is testing can be helpful.

Prof I'm not sure passing in a dict to the test means? Is that mean I need to store all the attribute info into a dict and then instantiate the class by directly referencing the dict?

codecov · 2024-08-07T17:42:10Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

sbillinge

nice job. Please see comment.s

sbillinge · 2024-08-07T19:40:03Z

src/diffpy/srmise/dataclusters.py

@@ -68,6 +68,21 @@ def __init__(self, x, y, res):
    def __iter__(self):
        return self

+    def __eq__(self, other):
+        # this function makes sure two DataClusters object is equal. Namely equal here means


please remove comment

sbillinge · 2024-08-07T19:57:07Z

src/diffpy/srmise/tests/test_dataclusters.py

+    actual.clear()
+
+    # Assert each expected attribute against its actual value after clearing
+    for attr, expected_value in expected.items():


this is no longer needed. Just instantiate an expected with the outputs and then assert equal. For readability, somthing like

input_cluster = DataClusters(x=inputs["x"]....) expected = DataClusters(x=outputs["x"].....) actual = input_cluster.clear() assert expected == actual

If the clear operates in place and you can't assign to a new instance, then make the last line assert expected == actual.clear

Initially I tried this approach but then I realized that we can't instantiate the expected object given those expected parameters. The reason why it cannot is because there is a function in the class called setdata that does not allow instantiating a object with 0 resolution as what clear does. here is part of the code that setdatadefines:
if res <= 0: raise ValueError("Resolution res must be greater than 0.")

The only way that makes your structure works is to refactor the source code here to change from <= to <, but I'm not sure if this would change the behavior of the function as Luke originally plans...

Why is it possible to set a zero resolution in clear but not when instantiating? This is the question. This is probably an issue with the code itself. Actually, this is one reason we love tests, because we discover things like this....

Remember, tests test "behavior", not code. What behavior do we want for the clear method?

Is clear used anywhere in the code? Try and figure out what it is used for, then make a suggestion for your preferred fix.

It is used in the initilization (init)of the object. Basically, it guarantees that it could create a clean object in the instantiating step. The code reference is here
self.clear()
self.setdata(x, y, res)

Did not expect to see a flurry of activity on SrMise and thought I should chime in where it would be helpful. A res of 0 should produce the trivial clustering -- each point is its own cluster. I most likely required res > 0 because the point of the class is to find non-trivial clusterings and because in an early stage of development this class was more tightly coupled to the physical resolution of the PDF, which certainly has a positive value. Setting setting res = 0 when clearing was a transient stand in for an uninitialized state.

Considered purely as a clustering method the trivial clustering is reasonable default behavior and the only change necessary is in setdata(), along the lines of

if res < 0: raise ValueError("Resolution res must be non-negative.")

If you make that change, the docstring for setvars() in pdfpeakextraction.py should probably also be updated to state
"cres: The clustering resolution, must be >= 0".

Thanks Luke (@Ainamacar ) it is really great to have your input. I hope all is well with you!

We are trying to refactor all our diffpy packages to a standard format this summer to make maintenance easier moving forward. It is also great training for new people like Steven (@stevenhua0320 ). Any input you can provide is really appreciated!

Thank you, Simon! I will email you a brief personal update.

In any case, I'd be happy to consult on this effort as I'm able. I'm glad to see the whole diffpy effort is going strong!

sbillinge

please read all the inline comments.

sbillinge · 2024-08-08T11:20:36Z

src/diffpy/srmise/dataclusters.py

+            and self.lastpoint_idx == other.lastpoint_idx
+            and self.status == other.status
+        )
+
    def clear(self):
        """Clear all members, including user data."""


we need a better docstring. When we touch code, we like to improve it.

src/diffpy/srmise/tests/test_dataclusters.py

sbillinge · 2024-08-08T11:23:10Z

src/diffpy/srmise/tests/test_dataclusters.py

+            {
+                "x": np.array([]),
+                "y": np.array([]),
+                "data_order": np.array([], dtype=np.int32),


why are we setting these to int32?

Because in the original code it set the attribute into int32, I'm not so sure whether modification would influence the behavior, so I retain the format from the original code to test.

yes, I kind of know why you did that, but I am also trying to emphasize a point that tests test behavior, not code, so we don't design tests to make code pass, we design tests to ensure specific behaviors, then we write code to make sure the tests pass. Then we are sure that the code has the right behavior. So understand why that is int32 and ask the question, why and is that the best thing? We may not find the answer and then we can default to mimicking the original code, but we should at least go through that process.

Both data_order and clusters store indices, which should be integers. (It looks like I also did this with check_qualidx in modelcluster.py.) So I believe this functions as a type hint or perhaps a vestige of habits from strongly typed languages.

Why int32? I don't remember, but I can make an educated guess. During original dev I used both 32 bit and 64 bit Python at various times and this might be a (very weak) gesture to compatibility between versions when serializing. Alternately, sometimes I was getting different results on the same input and, IIRC, found that small numerical differences between the 32 bit and 64 bit versions occasional caused different in branching during the recursive steps of peak extraction. So I may have specified int32 rather than int as a very early part of those debugging efforts, but never removed it.

I don't think changing this should have an impact -- if I knew about some subtle behavior I almost certainly would have left a comment. My opinion is that it would be better to complete the conversion to Python3 and test on some full PDFs before seeing what happens if this is changed. How that impacts what test should be written now, I'll leave to Simon.

yes, please remove. Per @Ainamacar this is technical debt and it generally better to remove it when we find it, than "trying to remember to remove it later if we need to".....

src/diffpy/srmise/tests/test_dataclusters.py

sbillinge · 2024-08-08T11:24:48Z

src/diffpy/srmise/tests/test_dataclusters.py

+    ],
+)
+def test_reset_clusters(inputs, expected):
+    actual = DataClusters(x=inputs["input_x"], y=inputs["input_y"], res=inputs["input_res"])


please see comment above about naming these.

sbillinge · 2024-08-08T11:26:22Z

src/diffpy/srmise/tests/test_dataclusters.py

+    assert actual == expected
+
+
+@pytest.mark.parametrize(


please see comments above on this too. Please start by doing a more informative docstring for each method you are testing. This helps to understand what each method is doing. It will also appear in the docs when we build them.

…duplicated case for testing behavior, remove other tests.

sbillinge

please see comments

src/diffpy/srmise/dataclusters.py

…numpy array.

sbillinge

Great job @stevenhua0320 this is looking great now. One final comment.

sbillinge · 2024-08-09T10:17:01Z

src/diffpy/srmise/dataclusters.py

+        """
+        Clear all data and reset the cluster object to a transient initial state.
+
+        This method performs the following steps:


this block is too much information, so we can remove it. It is information that is probably not needed by a user and a developer can look at the code to get the precise definition.

src/diffpy/srmise/dataclusters.py

sbillinge

nearly there. Please see inline comment. We also want to finish up the changes that allow the assert expected == actual to pass as per the conversation.

stevenhua0320 · 2024-08-09T16:24:41Z

nearly there. Please see inline comment. We also want to finish up the changes that allow the assert expected == actual to pass as per the conversation.

Simon, I tried the approach again, but it still seems that it could create error in this approach. The reason is that if we set expected for x and y be empty numpy list, then in the setdata function, it has two lines of code that correlates that.
self.data_order = self.y.argsort() # Defines order of clustering
self.clusters = np.array([[self.data_order[-1], self.data_order[-1]]])
it means it could give an index of list based on y value. In this case, the list would be empty. But in the second line of code, since the cluster needs to take on value of the last index of value, it could cause IndexOutOfBounds error.

sbillinge · 2024-08-09T18:33:32Z

nearly there. Please see inline comment. We also want to finish up the changes that allow the assert expected == actual to pass as per the conversation.

Simon, I tried the approach again, but it still seems that it could create error in this approach. The reason is that if we set expected for x and y be empty numpy list, then in the setdata function, it has two lines of code that correlates that. self.data_order = self.y.argsort() # Defines order of clustering self.clusters = np.array([[self.data_order[-1], self.data_order[-1]]]) it means it could give an index of list based on y value. In this case, the list would be empty. But in the second line of code, since the cluster needs to take on value of the last index of value, it could cause IndexOutOfBounds error.

The test tests behavior and so we want the test to be what we want. Then you need to modify code if the test is not passing. You can't change the test to make things pass.

sbillinge · 2024-08-09T23:07:51Z

@stevenhua0320 to get the hand of this, why don't you write some tests for the eq method you wrote. Since you write the function it will be easier to write tests.

Normally we would write the tests before the function though....

sbillinge · 2024-08-09T23:08:06Z

Btw, please put that on a separate PR

…or of the test passed.

sbillinge

This is good. Please see inline comments.

I also have a comment about the eq method, but we can deal with that on the separate PR with the tests for that.

sbillinge · 2024-08-10T12:42:09Z

src/diffpy/srmise/dataclusters.py

-
-        self.status = self.READY
+        # If x sequence size is empty, set the object into Initialized state.
+        if x.size == 0 and res == 0:


what will happen if x.size == 0 but res != 0?

This is not what I meant. I think that we want the clear method test to pass (which it was) but if we were to write tests for this method, there would be other situations than x.size == 0 and res == 0 and these were not being handled in your code fix.

Let's just revert this commit and I will merge, since it is passing tests, but write tests for this setdata method on a separate PR.

sbillinge · 2024-08-10T12:44:55Z

src/diffpy/srmise/tests/test_dataclusters.py

+from diffpy.srmise.dataclusters import DataClusters
+
+
+@pytest.mark.parametrize(


since we are only doing one test here, we don't need the mark.parameterize, you can just load the values directly into the test below.

stevenhua0320 · 2024-08-10T14:12:54Z

This is good. Please see inline comments.

I also have a comment about the eq method, but we can deal with that on the separate PR with the tests for that.

Yeah I've seen that in the message. Since this is a method that is in the Datacluster class, I would like to push eq PR after this test is merged.

sbillinge

please see comment

sbillinge · 2024-08-10T15:55:45Z

src/diffpy/srmise/dataclusters.py

-
-        self.status = self.READY
+        # If x sequence size is empty, set the object into Initialized state.
+        if x.size == 0 and res == 0:


This is not what I meant. I think that we want the clear method test to pass (which it was) but if we were to write tests for this method, there would be other situations than x.size == 0 and res == 0 and these were not being handled in your code fix.

Let's just revert this commit and I will merge, since it is passing tests, but write tests for this setdata method on a separate PR.

sbillinge · 2024-08-10T16:25:57Z

excellent @stevenhua0320 !

* Lint check & fix to python3 format (#18) * lint check and change files from python2 to python3 * pre-commit check for these files * lint check & change to python3 & pre-commit check (#19) * lint check & change to python3 & pre-commit check * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * lint check and fix print and exception python2 issue (#20) * lint check and fix python2 print and exception issues (#21) * lint check and fix python2 print and exception issues * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * finish parenthesizing print statements (#24) * finish parenthesizing print statements * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix too many leading #, import modules, and unused var (#29) * requirements (#30) * fix import module not used & string check (#25) * fix too many leading "#" in string block (#26) * lint check, remove unused import modules & remove too many "#". (#27) * remove unused modules, ambiguous variable name (#28) * cleaning (#31) * requirements * clean out __init__ * replace ### * ins not none in modelevaluators base * Copyright (#32) * requirements * basefunction * all the copyright statements * lint check, fix break import modules, remove unused import modules, remove some # (#33) * fix break import modules, remove unused import modules, fix docstring length (#34) * fix formatting issue and typo in copyright (#35) * clean out inits (#38) * clean out inits * [pre-commit.ci] auto fixes from pre-commit hooks * dataclusters.py, modelevaluators/aicc and modelparts.py --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * peakextraction.py and init (#40) * move untrack doc and requirement files (#41) * move untrack doc and requirement files * add requirement in run.txt * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add pyproject.toml (#42) * add pyproject.toml * [pre-commit.ci] auto fixes from pre-commit hooks * update classifiers pyproject.toml * Delete setup.py --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * move diffpy files to src dir (#44) * move diffpy files to src dir * [pre-commit.ci] auto fixes from pre-commit hooks * add Luke to authors --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * LICENSE (#45) * add two LICENSE.rst files into cookiecutter * fix LICENSE.rst and LICENSE_PDFgui.rst with correct references and year * resolve pdfdataset.py conflict --------- Co-authored-by: Simon Billinge <sbillinge@users.noreply.github.com> * add untrack files and add cookiecut.rst news (#46) * add untrack files and add cookiecut.rst news * delete README.txt * fix py2 -> py3, fix broken import, remove deprecation warning (#47) * fix py2 -> py3, move deprecation warning * fix search & split in binary files * fix broken import, remove deprecated pkg_resource (#50) * change import path to make it work. (#48) * fix import modules, py2->py3 (#49) * fix broken import in doc, change README to rst file. (#51) * fix broken import in doc, change README to rst file. * fix os getcwd method * fix p2 to p3 (#52) * add test for dataclusters (#54) * add test for dataclusters * define eq method in dataclusters.py * change parametrization form * add one more case and change reference name to actual * delete comment * add two more tests for DataClusters class function. * change in docstring for clearer explanation for clear method, remove duplicated case for testing behavior, remove other tests. * change clear method docstring into numpydoc format. Delete dtype for numpy array. * remove block * Make edition to condition on res, refactor for setdata to make behavior of the test passed. * change condition on res * add condition on x and res are incompatible, update test. * revert change in setdata method. * Eq tests (#59) * remove diffpy/srmise tree * test for eq * add attributes in eq method (#60) * Add set data test cases (#61) * add test cases to test files and make edition to make sure the behavior of the test pass. * [pre-commit.ci] auto fixes from pre-commit hooks * change case in test__eq__ to be compatible with the behavior of setdata * delete text and redundant tests * tweaking error message in DataClusters * [pre-commit.ci] auto fixes from pre-commit hooks * update test for checking implicit attributes for setdata function * [pre-commit.ci] auto fixes from pre-commit hooks * update test for setdata function * update setdata test to right format. * update to constructor test & make setdata clear function private * final tweaks to tests by Simon * fix actual_attribute typo * final refactor of actual_attr --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Simon Billinge <sbillinge@users.noreply.github.com> * fix arbitrary.py to numpydoc format (#68) * fix arbitrary.py to numpydoc format * pre-commit fix * change start sentence to 'The' * print things correctly (#71) * print things correctly * change to f string * reduce print to one line * change createpeak to actualize function (#72) * fix import and counting to make it work (#74) * refactor makeclusters to make it work (#73) * deprecation remove (#78) * deprecation remove * fix to right behavior * Revert "refactor makeclusters to make it work (#73)" (#79) This reverts commit 3773bcf. * try out py2 before py3 refactor to make sure correct workflow (#75) * fix false counting and numpy to int (#80) * fix false counting and numpy to int * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * numpydoc edition (#81) * change peakextraction function to numpydoc * pre-commit run * remove unused import * numpydoc build (#82) * numpydoc build on peakstability (#83) * numpydoc build for ModelCluster (#85) * numpydoc build for multimodelselection.py (#87) * numpydoc documentation build for ModelCluster class (#86) * numpydoc build for pdfdataset (#88) * numpydoc build for pdfpeakextraction.py (#89) * numpydoc build for gaussianoverr.py (#91) * numpydoc build for gaussianoverr.py * [pre-commit.ci] auto fixes from pre-commit hooks * fix pre-commit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * terminationripples.py numpydoc build (#92) * numpydoc build for gaussian.py (#90) * numpydoc build for gaussian.py * [pre-commit.ci] auto fixes from pre-commit hooks * pre-commit fix * update for FWHM and maxwidth * update for starting sentence --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * numpydoc build for base.py (#95) * numpydoc build for polynomial.py (#97) * numpydoc build for fromsequence.py (#99) * numpydoc build for nanospherical.py (#98) * numpydoc build for base.py in Baseline class (#96) * numpydoc build for base.py * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * numpydoc build for aic.py (#93) * numpydoc build for aicc.py (#94) * numpydoc build for ModelCovariance (#84) * numpydoc build for ModelCovariance * update format type and fix indentation issue * numpydoc build for modelparts.py (#100) * numpydoc build for modelparts.py * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * numpydoc build for basefunction.py (#101) * api workflow build for diffpy.srmise (#102) * api workflow build for diffpy.srmise * [pre-commit.ci] auto fixes from pre-commit hooks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add changed news (#103) --------- Co-authored-by: Rundong Hua <157993340+stevenhua0320@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

add test for dataclusters

5bf4556

sbillinge reviewed Aug 7, 2024

View reviewed changes

define eq method in dataclusters.py

e3fe16f

change parametrization form

fdab59c

add one more case and change reference name to actual

82888bb

sbillinge reviewed Aug 7, 2024

View reviewed changes

stevenhua0320 added 2 commits August 8, 2024 11:06

delete comment

f63abd5

add two more tests for DataClusters class function.

b959a09

sbillinge reviewed Aug 8, 2024

View reviewed changes

change in docstring for clearer explanation for clear method, remove …

de2e964

…duplicated case for testing behavior, remove other tests.

sbillinge reviewed Aug 8, 2024

View reviewed changes

src/diffpy/srmise/dataclusters.py Outdated Show resolved Hide resolved

change clear method docstring into numpydoc format. Delete dtype for …

b25e9a4

…numpy array.

sbillinge reviewed Aug 9, 2024

View reviewed changes

remove block

17582f0

sbillinge reviewed Aug 9, 2024

View reviewed changes

stevenhua0320 added 2 commits August 10, 2024 11:30

Make edition to condition on res, refactor for setdata to make behavi…

3638253

…or of the test passed.

change condition on res

ca8a169

sbillinge reviewed Aug 10, 2024

View reviewed changes

add condition on x and res are incompatible, update test.

2105474

sbillinge reviewed Aug 10, 2024

View reviewed changes

revert change in setdata method.

cdaa3f2

sbillinge merged commit 89c8c0b into diffpy:Cookie Aug 10, 2024
3 checks passed

		from diffpy.srmise.dataclusters import DataClusters


		@pytest.mark.parametrize(

add test for dataclusters #54

add test for dataclusters #54

Uh oh!

Conversation

stevenhua0320 commented Aug 7, 2024

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

stevenhua0320 commented Aug 7, 2024

Uh oh!

codecov bot commented Aug 7, 2024

Welcome to Codecov 🎉

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment

Uh oh!

stevenhua0320 commented Aug 9, 2024

Uh oh!

sbillinge commented Aug 9, 2024

Uh oh!

sbillinge commented Aug 9, 2024

Uh oh!

sbillinge commented Aug 9, 2024

Uh oh!

sbillinge left a comment

Choose a reason for hiding this comment