Skip to content

Tune python duplication remediation points #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ABaldwinHunter
Copy link
Contributor

  • Reduce AST threshold from 40 to 31 32 (classic is 28)
  • update point formula to match classic computation

Change from
remediations_points = n * score
to
remediation_points = x + (score-threshold) * y

This change increases parity with classic and overall increases the number of duplication issues reported.

link to grade comparisons

Note: @gordondiggs and I are exploring exactly why there are parser differences between Classic and Platform - there are a number of relevant factors, including likely differences between versions of Python.

Note on mass difference:

The mass of node corresponds to its size. Specifying a minimum threshold tells Code Climate to ignore duplication in nodes below a certain size (e.g. one liners).

The issue's Flay score is the result of its mass * number of occurrences (or number of occurrences ^ 2, if the code is identical).

Comparing issue mass between parser in Platform and Classic:

Platform Classic Platform / Classic
42 39 1.07
45 40 1.125
66 57 1.15789
123 109 1.1284
126 93 1.3548
246 218 1.1284

I've estimated the factor of mass difference to be ~ 1.15
Since the default Python duplication mass threshold on Classic was 28, and 28 * 1.15 = 32.19999, I've lowered our current default threshold for Python on Platform from 40 to 32.

On Classic, Python duplication issues were penalized in terms of remediation points as follows:

1_500_000 + overage * 50_000
where overage = score - threshold
( and score = f(mass) )

I've kept the base points but lowered the per_cost to 30_000 to account for the difference in mass parsing (which gets amplified in the points calculation.

@codeclimate/review


def calculate_points(issue)
BASE_POINTS + (overage(issue) * POINTS_PER_OVERAGE)
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I thought grades would get worse with this change, but overall they seem to be about the same.

https://gist.github.com/ABaldwinHunter/e280310b4987766efd8c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating these stats

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the comparisons of grades between classic & platform: it looks like these are overall grades across all categories? If we're trying to tune the engines, I think we should be trying to compare this engine specifically to its classic equivalent (duplication category smells). Otherwise I think we're comparing at too high a level to have confidence in a given engine's tuning. This is related to what was discussed in IPM today. I'd love to see the sum of remediation points for duplication issues between classic & platform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wfleming I think that's a good point, and so the table actually includes a column comparing only duplication and complexity between Platform and Classic, and omitting style.

The complexity issues have already been tuned, so I expect them to have roughly the same impact. It could be interesting to see precisely how many points per category each repo gets, but I'm not sure it's that much more helpful than comparing issues and overall GPA?

If we think that'd be useful to see though, I can work on some reverse engineering of remediation points and add a table of those.

After updating the stats, it looks like the grades are significantly worse but no incredibly dissimilar from Classic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the grades are significantly worse but no incredibly dissimilar from Classic.

…What? I'm not sure what you mean here. "Significantly worse" & "not incredibly dissimilar" sound contradictory.

If we think that'd be useful to see though, I can work on some reverse engineering of remediation points and add a table of those.

I don't think you need to reverse engineer anything: you can query for duplication issues in the smells mongo table for the respective snapshots you're looking at, and compare the remediation points for the smell records that seem to correlate (same file/location) between classic & platform. The number of duplication issues I saw in that gist seemed low enough that correlating by hand doesn't seem arduous.

There is a direct relationship between remediation points & repo grade. So if we want to feel confident that duplication is going to result in similar grades on platform as it did on classic, I think the best way to know that is by comparing the solid numbers we get from duplication issues between the two architectures, without all the other categories potentially muddying the waters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wfleming

it looks like the grades are significantly worse but no incredibly dissimilar from Classic.

it looks like the grades are significantly worse than currently on Platform, but not dissimilar from Classic.

There is a direct relationship between remediation points & repo grade.

True. It's important to note here though that we're not only comparing total remediation points, but also the variety of duplication issues found.

I can certainly query the database and make some stats if we're interested in seeing how many duplication-based remediation points are reported for a given repo or given issue on Classic and Platform.

My gut from the work so far is that the complexity-with-duplication grades are reflective of the duplication point differences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the grades are significantly worse than currently on Platform, but not dissimilar from Classic.

Ah, thanks, that makes sense.

@@ -27,7 +27,7 @@
"path" => "foo.py",
"lines" => { "begin" => 1, "end" => 1 },
})
expect(json["remediation_points"]).to eq(54000)
expect(json["remediation_points"]).to eq(3000000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💄 3_000_000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ABaldwinHunter
Copy link
Contributor Author

@codeclimate/review @wfleming

I made a script to collect stats that compare python duplication remediation points on classic, current platform, and platform with proposed updates: https://gist.github.com/ABaldwinHunter/da407f6657ef30a05069

The proposed new points formula brings our analysis closer to Classic, but in some cases assigns significantly steeper penalties than on Classic for the same issue. I think classic may add additional remediation cost per occurrence when multiple instances of code are found (they show up as individual smells, but share a fingerprint) - but am currently double-checking.

To reduce the impact of threshold differences, I might consider reducing the per_cost variable further (already reduced from 50_000 to 30_000).

I've also only run these stats on Flask. I could run them on a few additional repositories.

@wfleming
Copy link
Contributor

Nice to see that new points values are the same number of significant digits as classic! The difference between new platform/classic looks to be a ratio of ~2.17 on average. The remediation points -> GPA algorithm is an exponential step-off (i.e. twice as many points means one letter grade drop), so in terms of resultant grades at least, it seems like more per_cost tweaking might be valuable.

@ABaldwinHunter ABaldwinHunter force-pushed the abh-python-tuning branch 5 times, most recently from ef7d7e1 to d78e903 Compare January 22, 2016 23:50
@ABaldwinHunter
Copy link
Contributor Author

@codeclimate-review

Ready for re-review

Summary:

  • Update remediation points calculation (matches Classic, with one difference: don't penalize extra for identical code)
  • Bug fix: Report issue for each file with duplication, instead of only first location
  • Use mass as node size, not Flay score

@ABaldwinHunter
Copy link
Contributor Author

cc @brynary

@ABaldwinHunter
Copy link
Contributor Author

Note: this PR presents a few changes. I like keeping them together because they're finished and feel related, but can see the argument for splicing PR if that'd be cleaner.

@@ -9,7 +9,7 @@ When you violate DRY, bugs and maintenance problems are sure to follow. Duplicat
## Issue Mass

Duplicated code has a calculated mass, which can be thought of as a measure of how much logic has been duplicated.
This issue has a mass of `<%= issue.mass %>`: if you would like to change the minimum mass that will be reported as an issue, please see the details in [`codeclimate-duplication`'s documentation](https://github.com/codeclimate/codeclimate-duplication).
This issue has a mass of `<%= mass %>`: if you would like to change the minimum mass that will be reported as an issue, please see the details in [`codeclimate-duplication`'s documentation](https://github.com/codeclimate/codeclimate-duplication).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're putting this in the description, I think we can drop it from the read up. I thought that got done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I realized #77 hasn't shipped yet.

@brynary
Copy link
Member

brynary commented Jan 25, 2016

I have not reviewed this PR or read the context. I just saw one comment @pbrisbin made and it set off a flag...

The UI does not and should not attempt to "collapse" multiple Issues of duplication within the same file caused by multiple occurrences of the same code structure within a single file.

If a duplication is within a single file, and there are e.g. 2 occurrences, we should produce and render 2 Issues on the page. The count should show "2" in the sidebar also (not 1).

This is an important change we made over a year ago and is important. :) /c @pbrisbin @ABaldwinHunter

@pbrisbin
Copy link
Contributor

@brynary Perfect, thanks for weighing in. It seems I'm remembering when such a feature was discussed or implemented but wasn't around, or don't remember, when it was reverted or rejected. It's certainly easier to not have that UI logic, so that's good.

Can you speak generally to duplication fingerprinting? Does each instance get a unique fingerprint (it sounds like yes). Would you consider https://github.com/codeclimate/app/blob/master/app/models/smells_counter.rb#L50 a bug / hold-over from when some sort of duplication collapsing was present in the UI? It also sounds like this is a yes, but the answer could depend on if there are any other cases where duplications might be, well, duplicated and need uniqing.

@ABaldwinHunter
Copy link
Contributor Author

@pbrisbin I'm also leaning toward us saying that each duplication instance gets a unique fingerprint, and then any new instances or removals appear as new and fixed issues, rather than one improved situation.

But curious what B thinks.

@pbrisbin
Copy link
Contributor

Seems pretty well answered to me:

If a duplication is within a single file, and there are e.g. 2 occurrences, we should produce and render 2 Issues on the page. The count should show "2" in the sidebar also (not 1).

This discussion is also not meant to decide on a new Right Thing -- we're trying to get to the bottom of how Classic works so we can preserve it.

@brynary
Copy link
Member

brynary commented Jan 25, 2016

@pbrisbin I believe that on Classic each occurrence does get the same fingerprint, but that was mostly incidental relative to the context. I suspect either way would be acceptable for now, but probably safest to use what Classic does to avoid any surprises.

@ABaldwinHunter ABaldwinHunter force-pushed the abh-python-tuning branch 5 times, most recently from 3648fb6 to 6ff56f3 Compare January 25, 2016 19:30
@ABaldwinHunter
Copy link
Contributor Author

@wfleming Ready for re-review!

I think I addressed the code concerns you mentioned -

  1. reorganize violation handling of current_ and other_ sexps using an Issue class
  2. use sexp.mass instead of flay score reverse engineering
  3. don't use other_sexps.count in fingerprint to avoid false positives

Thing I didn't do and am trying to squeeze by: break into separate PRs and commits.

Included clear bulleted summary in the commit. I don't think at this point reverse engineering the commit history is worth while, especially because the changes are intertwine, but can go ahead and take the time to do it if we prefer it?

cc @codeclimate/review @brynary

"fingerprint": fingerprint,
}
end # rubocop:enable Metrics/MethodLength

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is largely a clean transplant from Violation. Added comments to disable method length check here bc it didn't seem useful.

@ABaldwinHunter ABaldwinHunter force-pushed the abh-python-tuning branch 2 times, most recently from 3722c97 to e3b46e6 Compare January 25, 2016 20:04
1. FIX - Use node size as mass, instead of flay score
2. FIX - Report issue for each instance of duplicated code, not just
first sexp.

3. UPDATE: Tune Python Remediation Points

- Reduce AST threshold from 40 to 32 (classic is 28)
- update point formula to match classic computation
- don't penalize extra for identical duplication

Change from
     remediations_points = x * score
to
     remediation_points = x + (score-threshold) * y

Since remediation points are a function of effort required to fix an
issue, we're making a behavioral change to not penalize extra for
identical duplication.
@ABaldwinHunter
Copy link
Contributor Author

Closing because this branch has been harvested and starfished into others.

@ABaldwinHunter ABaldwinHunter deleted the abh-python-tuning branch January 27, 2016 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants