-
Notifications
You must be signed in to change notification settings - Fork 118
Fix domain_is_in? for deeply nested subdomain
#270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
domain_is_in? for deeply nested subdomain
spec/address_spec.rb
Outdated
| it "calls the the MX servers lookup" do | ||
| it "calls the the MX servers lookup" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been automatically fixed by my code editor. Removing unnecessary spaces.
| return true if domain_list.include?(address_domain) | ||
|
|
||
| i = address_domain.index('.') | ||
| return false unless i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmhmmm, i guess it has always been truethy? i mean domains typically contain a dot. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micke was it on purpose, that in any case, the TLD so com from example.com was checked against the list? i guess we can speed up the lib significantly by getting rid of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me as well, the TLD check seems unnecessary.
Shall I remove the TLD check while we have this opportunity? I think it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation might look something like this:
domain_hierarchy_level = 0
while (i = address_domain.index('.')) && domain_hierarchy_level < @domain_hierarchy_max_depth
address_domain = address_domain[(i + 1)..-1]
# Skip checking TLD (if there's no more dot, it's considered as TLD)
break unless address_domain.include?('.')
return true if domain_list.include?(address_domain)
domain_hierarchy_level += 1
endThis implementation includes this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have not tested this thoroughly, but maybe you get the basic idea here and can finish it off
def domain_is_in?(domain_list, max_depth = 3)
address_domain = address.domain.downcase
return true if domain_list.include?(address_domain)
tokens = address_domain.split('.')
return false if tokens.length < 3
limited_sub_domain_part = tokens.reverse.first(max_depth).reverse.join('.')
domain_list.include?(limited_sub_domain_part)
endthere is no iteration over parts, so this is faster and we get rid of the while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so there are such cases. If that's the case, the current implementation that allows users to select max_depth optionally doesn't seem like a bad idea (Default is 3 ).
I'll wait for your judgment regarding the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've only recently joined as a maintainer, so i'd wait a bit for @micke to add some context information before changing any of the semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll also wait for @micke . Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was introduced in #137, i honestly don't know why it's there 😅 This sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/valid_email2/address.rb
Outdated
|
|
||
| i = address_domain.index('.') | ||
| return false unless i | ||
| while i = address_domain.index('.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've seen so many dead systems because of a broken while that i would try to get rid of it in any library implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would be good if this was actually a configurable option like max_depth = 3 on the domain so subdomains deeper that x would just be ignored and the rest would need to be configured individually in the disposable list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, I understand your concern. I think it's possible.
How about an interface like this?
- def initialize(address, dns_timeout = 5, dns_nameserver = nil)
+ def initialize(address, dns_timeout = 5, dns_nameserver = nil, domain_hierarchy_max_depth: 3)
@parse_error = false
@raw_address = address
@dns_timeout = dns_timeout
@dns_nameserver = dns_nameserver
+ @domain_hierarchy_max_depth = domain_hierarchy_max_depthThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation might look something like this:
domain_hierarchy_level = 0
while i = address_domain.index('.') && domain_hierarchy_level < @domain_hierarchy_max_depth
address_domain = address_domain[(i + 1)..-1]
return true if domain_list.include?(address_domain)
domain_hierarchy_level += 1
end|
closing this in favor of #278 |
|
@phoet Thank you! |
@micke
Hi,
Thank you for reviewing and merging my previous PR.
I found that there are some email address patterns that cannot be detected when using
domain_is_in?indisposable_domain?.For example, when
disposable_domain.comis included in the blocklist:foo@disposable_domain.comfoo@sub.disposable_domain.comcan be detected, but:
foo@sub2.sub1.disposable_domain.comwith two or more nested subdomains cannot be detected.
Therefore, I modified
domain_is_in?to detect deeply nested subdomains as well.I've also added RSpec test cases for this.
Thanks.