Skip to content

[ENH] Add Dynamic Alphabet Sizes for SFA #2844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

[ENH] Add Dynamic Alphabet Sizes for SFA #2844

wants to merge 17 commits into from

Conversation

patrickzib
Copy link
Contributor

@patrickzib patrickzib commented May 26, 2025

This PR introduces the concept of dynamic alphabet sizes to SFA.

The alphabet size is used as a budget and assigned over all coefficients to maximize tightness of lower bound. Alphabet sizes are assigned proportional to the variance using three 3 strategies:

  • Linear-proportional to variance
  • Sqrt-proportional to variance
  • Log2-proportional to variance

Illustration

Example with Alphabet Sizes [4, 4, 2, 2] and variance-based feature selection:
image

Example

E.g. Example for word length of 4 using 4 each, we have a budget of 16=4*4:

  • Prior to this PR the alphabet has to be fixed for each coefficient: [a-d, a-d, a-d, a-d] = [4, 4, 4, 4] = 16
  • Now, the number of symbols gets assigned based on importance: [a-h, a-d, a-d, a-b] = [8, 4, 4, 2] = 16

CD-Diagram for (average) alphabet-size 64

image

Experiments

Using this kind of assignment is most beneficial for smaller alphabet sizes. TLB results (larger is better) show that for 2 to 8 alphabet sizes large improvements can be observed.

Average Symbols 2 4 8 16 32 64 128 256
SFA 37.515 56.694 69.425 77.726 82.2309 85.6476 86.8577 87.5971
SFA+Linear 48.474 63.373 72.769 79.669 83.8591 86.0971 87.1459 87.6656
SFA+Log 46.017 60.966 72.265 79.352 83.8492 85.9773 87.075 87.628
SFA+Sqrt 44.958 60.841 71.268 79.280 83.6312 86.0426 87.1275 87.6555
iSAX 28.025 43.014 54.823 62.948 69.5433 75.366 78.3346 80.1139

@patrickzib patrickzib requested a review from baraline May 26, 2025 12:29
@patrickzib patrickzib self-assigned this May 26, 2025
@patrickzib patrickzib added the similarity search Similarity search package label May 26, 2025
@aeon-actions-bot aeon-actions-bot bot added the enhancement New feature, improvement request or other non-bug code enhancement label May 26, 2025
@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ $\color{#FEF1BE}{\textsf{enhancement}}$ ].
I would have added the following labels to this PR based on the changes made: [ $\color{#5209C9}{\textsf{distances}}$, $\color{#41A8F6}{\textsf{transformations}}$ ], however some package labels are already present.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Push an empty commit to re-run CI checks

@patrickzib patrickzib added the distances Distances package label Jun 4, 2025
Copy link
Member

@baraline baraline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments didn't pick up anything major, otherwise lgtm

X_test = zscore(X_test.squeeze(), axis=1)
histogram_type = "equi-width"

# print("Testing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left over comment

Comment on lines +44 to +48
alphabet_allocation_methods = {
"linear_scale",
"log_scale",
"sqrt_scale",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, you would use this list in testing by importing it so it can reflect new potential future additions

normed_scale = variance / variance.mean()
elif self.alphabet_allocation_method == "log_scale":
variance = np.log2((self.dft_variance[self.support]) + 1)
normed_scale = variance / variance.mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but you could put normed scale after the if conditions if it happens in all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distances Distances package enhancement New feature, improvement request or other non-bug code enhancement similarity search Similarity search package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants