Skip to content

get_tld(), get_components(), and other #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions UseCases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
This file describes exact behavior of methods for different edge cases and
explains general logic. This description covers the behavior of get_tld,
get_tld_unsafe, get_sld, get_sld_unsafe, split_domain, split_domain_unsafe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadym-t suggest to add here a why sentence:
Unsafe versions of the methods will significantly save resources on large-scale applications of the library where the data has already been converted to lowercase and missing data has a None value. This can be done in Spark/Dask, for example, and result in a significant reduction in computational resources. For adhoc usage, the original functions are sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Unsafe versions of the methods will significantly save resources on large-scale
applications of the library where the data has already been converted to
lowercase and missing data has a None value. This can be done in Spark/Dask,
for example, and result in a significant reduction in computational resources.
For adhoc usage, the original functions are sufficient.

1. general difference of get_*() and get_*_unsafe() methods:
get_*_unsafe() does not perform if the input string is None and does not
transforms it to the lower case.

2. The listed above methods works only with non-canonical FQDN strings -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listed above means all or just the unsafe methods?

trailing dot must be removed before call the method. This restriction allows
get rid of fuzzy logic in edge cases.

3. DNS does not support empty labels - if some label detected to be empty,
None will be returned.

4. Every method processes provided FQDN in the reverse order, from the last
label towards the start of the string. It stops when the specific task is
completed. Therefore no validation occurs outside of this scope. For example,
```
get_tld('......com') -> 'com'
```
as leading dots are not processed.
split_domain method is based on get_sld method - it returns everything in
front of get_sld() as a prefix.
Specifically to example above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadym-t you might also add here a non-edge case example. the split_domain() method offers a new capability to the library-- one that folks might get from other libraries -- but your only example is the edge. suggest something like:
split_domain allows you to recover the host, or prefix, of an SLD, for use in aggregation or analysis based on the labels. e.g., split_domain('www.googl.com')

```
split_domain('......com') -> ('....',None,'com')
```
Edge cases and expected behavior
The behavior of the library can be illustrated best on the small examples:
(boolean arguments are omitted if does not affect behavior )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested additional:
The library allows you to create a public suffix list from any file-like object, including a list. In the examples below, we construct the PSL with different lists to demonstrate the functionality of the library under different conditions. In particular, the examples show an empty list, list of two elements, and the use of negation in a list. Combinations of the method parameters are given with notes about the result. These results follow the logic of the Mozilla library as published on the psl.org website.

## get_tld()
###Degenerate case (empty list)

| input | strict | wildcard | result | notes |
|--------|---------|----------|--------|-------|
| '' | | | None | empty labels are not allowed |
| '.' | | | None | empty labels are not allowed |
| '..' | | | None | empty labels are not allowed |
| '....' | | | None | empty labels are not allowed |
| 'abc' | false | | 'abc' | non-strict mode, the last label is TLD |
| 'abc' | true | | None | 'abc' not in the list |
| '.abc' | false | | 'abc' | non-strict mode, the last label is TLD |
| '.abc' | true | | None | 'abc' not in the list |
| 'abc.' | | | None | empty labels are not allowed |
| '....abc' | false | | 'abc' | non-strict mode, string head is not processed|
| '....abc' | true | | None | 'abc' not in the list |
| 'example.abc' | false | | 'abc' | non-strict mode, the last label is TLD |
| 'example.abc' | true | | None | 'abc' not in the list |

###Simple case, no wildcards (['com'])

| input | strict | wildcard | result | notes |
|--------|---------|----------|--------|-------|
| '' | | | None | empty labels are not allowed |
| '.' | | | None | empty labels are not allowed |
| '..' | | | None | empty labels are not allowed |
| '....' | | | None | empty labels are not allowed |
| 'abc' | false | | 'abc' | non-strict mode |
| 'abc' | true | | None | not in the list |
| 'com' | | | 'com' | allowed TLD |
| '.abc' | false | | 'abc' | non-strict mode |
| '.abc' | true | | None | not in the list |
| '.com' | | | 'com' | allowed TLD |
| 'abc.' | | | None | empty labels are not allowed |
| '....abc' | false | | 'abc' | non-strict mode, string head is not processed|
| '....abc' | true | | None | not in the list |
| '....com' | | | 'com' | allowed TLD, string head is not processed|
| 'example.abc' | false | | 'abc' | non-strict mode, the last label is TLD |
| 'example.abc' | true | | None | 'abc' not in the list |
| 'example.com' | | | 'com' | allowed TDL |

### Simple case, negation, no wildcards (['com', '!org'])

| input | strict | wildcard | result | notes |
|--------|---------|----------|--------|-------|
| '' | | | None | empty labels are not allowed |
| '.' | | | None | empty labels are not allowed |
| '..' | | | None | empty labels are not allowed |
| '....' | | | None | empty labels are not allowed |
| 'abc' | false | | 'abc' | non-strict mode |
| 'abc' | true | | None | not in the list |
| 'com' | | | 'com' | allowed TLD |
| 'org' | | | None | not allowed TLD |
| '.abc' | false | | 'abc' | non-strict mode |
| '.abc' | true | | None | not in the list |
| '.com' | | | 'com' | allowed TLD |
| '.org' | | | None | not allowed TLD |
| 'abc.' | | | None | empty labels are not allowed |
| 'com.' | | | None | empty labels are not allowed |
| 'org.' | | | None | empty labels are not allowed |
| '....abc' | false | | 'abc' | non-strict mode, string head is not processed|
| '....abc' | true | | None | not in the list |
| '....com' | | | 'com' | allowed TLD, string head is not processed|
| '....org' | | | None | not allowed TLD|
| 'example.abc' | false | | 'abc' | non-strict mode, the last label is TLD |
| 'example.abc' | true | | None | 'abc' not in the list |
| 'example.com' | | | 'com' | allowed TDL |
| 'example.org' | | | None | not allowed TDL |
Loading