Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliably identify a relevant sitename when configuring new sites #69

Open
jmorgannz opened this issue Nov 28, 2021 · 0 comments
Open

Reliably identify a relevant sitename when configuring new sites #69

jmorgannz opened this issue Nov 28, 2021 · 0 comments

Comments

@jmorgannz
Copy link
Contributor

jmorgannz commented Nov 28, 2021

This is a separate feature opened after discussion of the subject in the #68 PR.

Example domain name:

  • subdomaina.subdomainb.website.com.tw

Discussion to date has yielded the following:

  1. Identify the public suffix of a domain (com.tw)
  2. Identify the sitename as the first domain segment left of the public suffix, including the public suffix (sitename.com.tw)
  3. Ignore 0-n subdomains left of the sitename (subdomaina.subdomainb.website.com.tw)
  4. In the case where it is desired to identify the sitename as a subdomain, manual override should be accepted

Methods discussed:

  1. Use a limited pre-defined list of public suffixes augmented by user configurable over-rides. (Treat as same site #68)
  2. Source and load a comprehensive list of fixed public suffixes requiring no user configuration or override.

Solution 2 has been leaned toward as a favourite by importing a copy of the Public Suffix List and a system to read / use it.

ttyridal added a commit that referenced this issue Dec 16, 2021
When suggesting a sitename we try to find the "significant"
part of the url. for www.google.com that would be google.com,
but just keeping the two last parts (or removing the first one)
fail too often. amazon.co.uk is one example.

Further, each TLD has it's own policy here, so an algorithmic
approch is bound to fail.  https://publicsuffix.org/ tries
to gather all possible SLD's. it might not be perfect, but
better than what we have (hardcoding a couple like (com|edu|
co).*

The list is rather large, but with some clever(?) tricks
we can get it down to an acceptable size:

Going a bit crazy here. Browsers don't support gzip/deflate data yet
(waiting for the Compression Streams API) and other compression
schemes where reasonable libs are available simply don't cut it
on the compression rate.

in the mean time, png is lossless and deflate compression -
exactly what we need  :)   So this patch pre-process theh PSL
list for easy lookup (and removes a lot of reduntant text) and
export the result as a json dictionary.

this is then converted to png by imagemagick.

The browser loads the image, we access the pixel values and end
up with our desired json dict.

Issue #69
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant