PyDomainExtractor is a Python library designed to parse domain names quickly. In order to achieve the highest performance possible, the library was written in Rust.
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
PyDomainExtractor | pydomainextractor.extract | 1.50s |
publicsuffix2 | publicsuffix2.get_sld | 9.92s |
tldextract | __call__ | 29.23s |
tld | tld.parse_tld | 34.48s |
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library | Function | Time |
---|---|---|
PyDomainExtractor | pydomainextractor.extract_from_url | 2.24s |
publicsuffix2 | publicsuffix2.get_sld | 10.84s |
tldextract | __call__ | 36.04s |
tld | tld.parse_tld | 57.87s |
pip3 install PyDomainExtractor
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'com'
>>> }
# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
'tld\n'
'custom.tld\n'
)
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': 'google',
>>> 'domain': 'com',
>>> 'suffix': ''
>>> }
domain_extractor.extract('google.custom.tld')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'custom.tld'
>>> }
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.extract_from_url('http://google.com/')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'com'
>>> }
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.is_valid_domain('google.com')
>>> True
domain_extractor.is_valid_domain('domain.اتصالات')
>>> True
domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True
domain_extractor.is_valid_domain('domain-.com')
>>> False
domain_extractor.is_valid_domain('-sub.domain.com')
>>> False
domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.get_tld_list()
>>> [
>>> 'bostik',
>>> 'backyards.banzaicloud.io',
>>> 'biz.bb',
>>> ...
>>> ]
Distributed under the MIT License. See LICENSE
for more information.
Gal Ben David - gal@intsights.com
Project Link: https://github.com/Intsights/PyDomainExtractor