Skip to content

Commit 9cd5417

Browse files
author
KnitCode
committed
implements an IDNA-encoded version of publicsuffix2.
includes additional functionality for strict checks, ignoring wildcards, and finding eTLD only. maps main function of get_public_suffix() to get_sld() for clarity.
1 parent e78e8a9 commit 9cd5417

File tree

6 files changed

+302
-37
lines changed

6 files changed

+302
-37
lines changed

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,6 @@ graft src
22

33
include CHANGELOG.rst
44
include README.rst
5+
include publicsuffix2.LICENSE
56

67
global-exclude *.py[co] __pycache__ *.so *.pyd

README.rst

Lines changed: 88 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,37 @@
11
Public Suffix List module for Python
22
====================================
33

4-
This module allows you to get the public suffix of a domain name using the
5-
Public Suffix List from http://publicsuffix.org
4+
This module allows you to get the public suffix, as well as the registrable domain,
5+
of a domain name using the Public Suffix List from http://publicsuffix.org
66

77
A public suffix is a domain suffix under which you can register domain
8-
names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us".
8+
names. It is sometimes referred to as the extended TLD (eTLD).
9+
Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us".
910
Accurately knowing the public suffix of a domain is useful when handling
1011
web browser cookies, highlighting the most important part of a domain name
1112
in a user interface or sorting URLs by web site.
1213

14+
This module builds the public suffix list as a Trie structure, making it more efficient
15+
than other string-based modules available for the same purpose. It can be used
16+
effectively in large-scale distributed environments, such as PySpark.
17+
1318
This Python module includes with a copy of the Public Suffix List so that it is
1419
usable out of the box. Newer versions try to provide reasonably fresh copies of
1520
this list. It also includes a convenience method to fetch the latest list.
1621

17-
The code is a fork of the publicsuffix package and uses the same base API.
18-
You just need to import publicsuffix2 instead
22+
The code is a fork of the publicsuffix2 package and includes the same base API. In
23+
addition, it contains a few variants useful for certain use cases, such as the option to
24+
ignore wildcards or return only the extended TLD (eTLD).
25+
Publicsuffix2 is a an extension of publicsuffix, and uses the same base API.
26+
You just need to import publicsuffix2 instead.
27+
28+
The public suffix list is now provided in UTF-8 format. To correctly process
29+
IDNA-encoded domains, either the query or the list must be converted. This module
30+
contains the option to IDNA-encode the public suffix list upon creating the Trie; this
31+
is set to happen by default. If your use case includes UTF-8 domains, e.g., '食狮.com.cn',
32+
you'll need to set the IDNA-encoding flag to False on instantiation (see examples below).
33+
Failure to use the correct encoding for your use case can lead to incorrect results for
34+
domains that utilize unicode characters.
1935

2036
The code is MIT-licensed and the publicsuffix data list is MPL-2.0-licensed.
2137

@@ -31,6 +47,10 @@ The code is MIT-licensed and the publicsuffix data list is MPL-2.0-licensed.
3147
Usage
3248
-----
3349

50+
To install from source, first build the package file:
51+
python setup.py build sdist
52+
and then pip install from the dist directory.
53+
3454
Install with::
3555

3656
pip install publicsuffix2
@@ -103,6 +123,63 @@ You can use it this way::
103123
Note that the once loaded, the data file is cached and therefore fetched only
104124
once.
105125

126+
If using this library in large-scale pyspark processing, you should instantiate the class as
127+
a global variable, not within a user function. The class methods can then be used within user
128+
functions for distributed processing.
129+
130+
Changes in this Fork
131+
--------------------
132+
133+
This fork of publicsuffix2 addresses a change in the format to the standard public suffix list,
134+
which was previously IDNA-encoded and now is in UTF-8 format, as well as some additional
135+
functionality useful to certain use cases. These additions include the ability to ignore
136+
wildcards and to require strict adherence to the TLDs included in the list. Lastly, we include
137+
some convenience functions for obtaining only the extended TLD (eTLD) rather than the
138+
registrable domain (SLD). These are outlined below.
139+
140+
IDNA-encoding. The public suffix list is now provided in UTF-8 format. For those use cases that
141+
include IDNA-encoded domains, the module will not return accurate results unless the list is
142+
converted. In this fork, IDNA encoding is included as a parameter in the class and is on by
143+
default.::
144+
145+
>>> from publicsuffix2 import PublicSuffixList
146+
>>> psl = PublicSuffixList(idna=True) # on by default
147+
>>> psl.get_public_suffix('www.google.com')
148+
'google.com'
149+
>>> psl = PublicSuffixList(idna=False) # use UTF-8 encodings
150+
>>> psl.get_public_suffix('食狮.com.cn')
151+
'食狮.com.cn'
152+
153+
Ignore wildcards. In some use cases, particularly those related to large-scale domain processing,
154+
the user might want to ignore wildcards to create more aggregation. This is possible by setting
155+
the parameter wildcard=False.
156+
157+
Require valid eTLDs (strict). In the publicsuffix2 module, a domain with an invalid TLD will still return
158+
a public suffix, e.g,::
159+
160+
>>> psl.get_public_suffix('www.mine.local')
161+
'mine.local'
162+
163+
164+
This is useful for many use cases, while in others, we want to ensure that the domain includes a
165+
valid eTLD. In this case, the boolean parameter strict provides a solution. If this flag is set,
166+
an invalid TLD will return None.::
167+
168+
>>> psl.get_public_suffix('www.mine.local', strict=True) is None
169+
True
170+
171+
Return eTLD only. The standard use case for publicsuffix2 is to return the registrable domain
172+
according to the public suffix list. In some cases, however, we only wish to find the eTLD
173+
itself. In this fork, this is available via the get_tld() method.::
174+
175+
>>> psl.get_tld('www.google.com')
176+
'com'
177+
178+
All of the methods and functions include the wildcard and strict parameters.
179+
180+
For convenience, the public method get_sld() is available. This is identical to the method
181+
get_public_suffix() and is intended to clarify the output for some users.
182+
106183

107184
Source
108185
------
@@ -116,7 +193,11 @@ branch::
116193

117194
History
118195
-------
119-
This code is forked from Tomaž Šolc's fork of David Wilson's code originally at:
196+
This code is forked from NexB's fork of Tomaž Šolc's fork of David Wilson's code.
197+
198+
The original publicsuffix2 code is Copyright (c) 2015 nexB Inc.
199+
200+
David Wilson's code originally at:
120201

121202
https://www.tablix.org/~avian/git/publicsuffix.git
122203

@@ -138,6 +219,7 @@ License
138219
The code is MIT-licensed.
139220
The vendored public suffix list data from Mozilla is under the MPL-2.0.
140221

222+
Copyright (c) 2019 Renée Burton
141223

142224
Copyright (c) 2015 nexB Inc.
143225

publicsuffix2.LICENSE

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
Copyright (c) 2019 Renée Burton
2+
This code is based on nexB Inc. code found at:
3+
https://www.github.com/nexB/python-publicsuffix2
4+
15
Copyright (c) 2015 nexB Inc.
26
This code is based on Tomaž Šolc fork of David Wilson code originally at
37
https://www.tablix.org/~avian/git/publicsuffix.git

setup.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,14 +81,14 @@ def run(self):
8181

8282
setup(
8383
name='publicsuffix2',
84-
version='2.20190205',
84+
version='2.20190328',
8585
license='MIT and MPL-2.0',
8686
description='Get a public suffix for a domain name using the Public Suffix '
8787
'List. Forked from and using the same API as the publicsuffix package.',
8888
long_description='%s\n%s' % (read('README.rst'), read('CHANGELOG.rst')),
89-
author='nexB Inc., Tomaz Solc and David Wilson',
90-
author_email='info@nexb.com',
91-
url='https://github.com/nexB/python-publicsuffix2',
89+
author='Renée Burton, nexB Inc., Tomaz Solc and David Wilson',
90+
author_email='',
91+
url='https://github.com/KnitCode/python-publicsuffix2',
9292
packages=find_packages('src'),
9393
package_dir={'': 'src'},
9494
py_modules=[splitext(basename(path))[0] for path in glob('src/*.py')],

0 commit comments

Comments
 (0)