11Public Suffix List module for Python
22====================================
33
4- This module allows you to get the public suffix of a domain name using the
5- Public Suffix List from http://publicsuffix.org
4+ This module allows you to get the public suffix, as well as the registrable domain,
5+ of a domain name using the Public Suffix List from http://publicsuffix.org
66
77A public suffix is a domain suffix under which you can register domain
8- names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us".
8+ names. It is sometimes referred to as the extended TLD (eTLD).
9+ Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us".
910Accurately knowing the public suffix of a domain is useful when handling
1011web browser cookies, highlighting the most important part of a domain name
1112in a user interface or sorting URLs by web site.
1213
14+ This module builds the public suffix list as a Trie structure, making it more efficient
15+ than other string-based modules available for the same purpose. It can be used
16+ effectively in large-scale distributed environments, such as PySpark.
17+
1318This Python module includes with a copy of the Public Suffix List so that it is
1419usable out of the box. Newer versions try to provide reasonably fresh copies of
1520this list. It also includes a convenience method to fetch the latest list.
1621
17- The code is a fork of the publicsuffix package and uses the same base API.
18- You just need to import publicsuffix2 instead
22+ The code is a fork of the publicsuffix2 package and includes the same base API. In
23+ addition, it contains a few variants useful for certain use cases, such as the option to
24+ ignore wildcards or return only the extended TLD (eTLD).
25+ Publicsuffix2 is a an extension of publicsuffix, and uses the same base API.
26+ You just need to import publicsuffix2 instead.
27+
28+ The public suffix list is now provided in UTF-8 format. To correctly process
29+ IDNA-encoded domains, either the query or the list must be converted. This module
30+ contains the option to IDNA-encode the public suffix list upon creating the Trie; this
31+ is set to happen by default. If your use case includes UTF-8 domains, e.g., '食狮.com.cn',
32+ you'll need to set the IDNA-encoding flag to False on instantiation (see examples below).
33+ Failure to use the correct encoding for your use case can lead to incorrect results for
34+ domains that utilize unicode characters.
1935
2036The code is MIT-licensed and the publicsuffix data list is MPL-2.0-licensed.
2137
@@ -31,6 +47,10 @@ The code is MIT-licensed and the publicsuffix data list is MPL-2.0-licensed.
3147Usage
3248-----
3349
50+ To install from source, first build the package file:
51+ python setup.py build sdist
52+ and then pip install from the dist directory.
53+
3454Install with::
3555
3656 pip install publicsuffix2
@@ -103,6 +123,63 @@ You can use it this way::
103123Note that the once loaded, the data file is cached and therefore fetched only
104124once.
105125
126+ If using this library in large-scale pyspark processing, you should instantiate the class as
127+ a global variable, not within a user function. The class methods can then be used within user
128+ functions for distributed processing.
129+
130+ Changes in this Fork
131+ --------------------
132+
133+ This fork of publicsuffix2 addresses a change in the format to the standard public suffix list,
134+ which was previously IDNA-encoded and now is in UTF-8 format, as well as some additional
135+ functionality useful to certain use cases. These additions include the ability to ignore
136+ wildcards and to require strict adherence to the TLDs included in the list. Lastly, we include
137+ some convenience functions for obtaining only the extended TLD (eTLD) rather than the
138+ registrable domain (SLD). These are outlined below.
139+
140+ IDNA-encoding. The public suffix list is now provided in UTF-8 format. For those use cases that
141+ include IDNA-encoded domains, the module will not return accurate results unless the list is
142+ converted. In this fork, IDNA encoding is included as a parameter in the class and is on by
143+ default.::
144+
145+ >>> from publicsuffix2 import PublicSuffixList
146+ >>> psl = PublicSuffixList(idna=True) # on by default
147+ >>> psl.get_public_suffix('www.google.com')
148+ 'google.com'
149+ >>> psl = PublicSuffixList(idna=False) # use UTF-8 encodings
150+ >>> psl.get_public_suffix('食狮.com.cn')
151+ '食狮.com.cn'
152+
153+ Ignore wildcards. In some use cases, particularly those related to large-scale domain processing,
154+ the user might want to ignore wildcards to create more aggregation. This is possible by setting
155+ the parameter wildcard=False.
156+
157+ Require valid eTLDs (strict). In the publicsuffix2 module, a domain with an invalid TLD will still return
158+ a public suffix, e.g,::
159+
160+ >>> psl.get_public_suffix('www.mine.local')
161+ 'mine.local'
162+
163+
164+ This is useful for many use cases, while in others, we want to ensure that the domain includes a
165+ valid eTLD. In this case, the boolean parameter strict provides a solution. If this flag is set,
166+ an invalid TLD will return None.::
167+
168+ >>> psl.get_public_suffix('www.mine.local', strict=True) is None
169+ True
170+
171+ Return eTLD only. The standard use case for publicsuffix2 is to return the registrable domain
172+ according to the public suffix list. In some cases, however, we only wish to find the eTLD
173+ itself. In this fork, this is available via the get_tld() method.::
174+
175+ >>> psl.get_tld('www.google.com')
176+ 'com'
177+
178+ All of the methods and functions include the wildcard and strict parameters.
179+
180+ For convenience, the public method get_sld() is available. This is identical to the method
181+ get_public_suffix() and is intended to clarify the output for some users.
182+
106183
107184Source
108185------
@@ -116,7 +193,11 @@ branch::
116193
117194History
118195-------
119- This code is forked from Tomaž Šolc's fork of David Wilson's code originally at:
196+ This code is forked from NexB's fork of Tomaž Šolc's fork of David Wilson's code.
197+
198+ The original publicsuffix2 code is Copyright (c) 2015 nexB Inc.
199+
200+ David Wilson's code originally at:
120201
121202https://www.tablix.org/~avian/git/publicsuffix.git
122203
@@ -138,6 +219,7 @@ License
138219The code is MIT-licensed.
139220The vendored public suffix list data from Mozilla is under the MPL-2.0.
140221
222+ Copyright (c) 2019 Renée Burton
141223
142224Copyright (c) 2015 nexB Inc.
143225
0 commit comments