Skip to content

Add abnormal spacing detection for spam messages #206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 29, 2024
Merged

Conversation

umputun
Copy link
Owner

@umputun umputun commented Dec 26, 2024

Introduced a new checker to detect abnormal spacing patterns in messages, a common spam evasion technique. It calculates both ratios of short words vs total words as well as ratio of spaces.

The goal is to detect "broken with space" words, i.e. "Th is an exa mple of such cr ap" . The checker configured with this set of options and disabled by default

	AbnormalWords struct {
		Enabled                 bool    `long:"enabled" env:"ENABLED" description:"enable abnormal words check"`
		SpaceRatioThreshold     float64 `long:"ratio" env:"RATIO" default:"0.3" description:"the ratio of spaces to all characters in the message"`
		ShortWordThreshold      int     `long:"short-word" env:"SHORT_WORD" default:"3" description:"the length of the word to be considered short"`
		ShortWordRatioThreshold float64 `long:"short-ratio" env:"SHORT_RATIO" default:"0.7" description:"the ratio of short words to all words in the message"`
	} `group:"space" namespace:"space" env-namespace:"SPACE"`

Introduced a new checker to detect abnormal spacing patterns in messages, a common spam evasion technique. Added configuration options and tests for thresholds on short words and space ratios. Updated README and application code to document and support the new functionality.
Copy link

cloudflare-workers-and-pages bot commented Dec 26, 2024

Deploying tg-spam with  Cloudflare Pages  Cloudflare Pages

Latest commit: b92d2c8
Status: ✅  Deploy successful!
Preview URL: https://a4642519.tg-spam.pages.dev
Branch Preview URL: https://abnormal-spacing.tg-spam.pages.dev

View logs

@umputun umputun requested a review from paskal December 26, 2024 19:00
@@ -137,6 +137,12 @@ This option is disabled by default. If `--meta.forward` set or `env:META_FORWARD

Using words that mix characters from multiple languages is a common spam technique. To detect such messages, the bot can check the message for the presence of such words. This option is disabled by default and can be enabled with the `--multi-lang=, [$MULTI_LANG]` parameter. Setting it to a number above `0` will enable this check, and the bot will mark the message as spam if it contains words with characters from more than one language in more than the specified number of words.

** Abnormal spacing check**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ExcessiveSpacing" might be better name here and in code.

@umputun umputun merged commit 2eaa5d7 into master Dec 29, 2024
5 checks passed
@umputun umputun deleted the abnormal-spacing branch December 29, 2024 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants