Skip to content

Conversation

joaoneto9
Copy link

Describe your change:

Added an optimized version of the prune function using Counter to improve performance
when checking candidate itemsets for frequent items.

I used as a test base a gradually increasing size of the itemset list to demonstrate
the inefficiency of the original algorithm, which had a complexity of O(n * c * i),
where n is the size of itemset, c is the number of candidates, and i is the number of
items in each candidate.

The new solution reduces the complexity to O(n + c * i). Previously, the algorithm would
iterate over itemset (O(n)) and count occurrences for each item (O(n)) every time it
needed to check a candidate, resulting in repeated costly operations.

To optimize this, I used an auxiliary dictionary (via Counter) where each key is an
item and its value is the number of occurrences in itemset. This allows both the check
and count operations to be performed in constant time O(1).

As a result, the performance improvement is significant, at the cost of a small additional
memory usage, which is a worthwhile trade-off. This improvement can be observed by
comparing the execution of both algorithms (as shown in the attached image).

Here is the graph comparing both functions:
pruneOptimized_prune_algoritm_results.pdf

Unit tests were also conducted on my local machine to ensure the consistency of results between the two methods, but they are not included in this PR.

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Add or change doctests? -- Note: Please avoid changing both code and tests in a single pull request.
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation.
  • If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

@algorithms-keeper algorithms-keeper bot added the tests are failing Do not merge until tests pass label Sep 24, 2025
@algorithms-keeper algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Sep 24, 2025
@algorithms-keeper algorithms-keeper bot removed tests are failing Do not merge until tests pass labels Sep 24, 2025
@joaoneto9
Copy link
Author

I hadn't realized that the itemset could be a list of lists. As a result, hashing these data structures was not possible, so I switched to using tuples, which are immutable, as keys for the Counter. After this change, I noticed a slight overhead, since each item now needs to be converted into a tuple to be checked within the Counter structure. Nonetheless, there is a significant efficiency gain in the worst-case scenario, and I believe it will also improve performance in average cases. I have not yet tested these other scenarios or generated their corresponding graphs. Below is the graph reflecting the new modification.

pruneOptimized_prune_algoritm_results.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant