go-fasttld is a high performance top level domains (TLD) extraction module implemented with compressed tries.
This module is a port of the Python fasttld module, with additional modifications to support extraction of subcomponents from full URLs and IPv4 addresses.
go-fasttld extracts subcomponents like top level domains (TLDs), subdomains and hostnames from URLs efficiently by using the regularly-updated Mozilla Public Suffix List and the compressed trie data structure.
For example, it extracts the com
TLD, maps
subdomain, and google
domain from https://maps.google.com:8080/a/long/path/?query=42
.
go-fasttld also supports extraction of private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com', and extraction of IPv4 addresses (e.g. https://127.0.0.1
).
Splitting on "." and taking the last element only works for simple TLDs like .com
, but not more complex ones like oseto.nagasaki.jp
.
go get github.com/elliotwutingfeng/go-fasttld
Full demo available in the examples folder
// Initialise fasttld extractor
extractor, _ := fasttld.New(fasttld.SuffixListParams{})
//Extract URL subcomponents
url := "https://some-user@a.long.subdomain.ox.ac.uk:5000/a/b/c/d/e/f/g/h/i?id=42"
res := extractor.Extract(fasttld.URLParams{URL: url})
// Display results
fmt.Println(res.Scheme) // https://
fmt.Println(res.UserInfo) // some-user
fmt.Println(res.SubDomain) // a.long.subdomain
fmt.Println(res.Domain) // ox
fmt.Println(res.Suffix) // ac.uk
fmt.Println(res.RegisteredDomain) // ox.ac.uk
fmt.Println(res.Port) // 5000
fmt.Println(res.Path) // a/b/c/d/e/f/g/h/i?id=42
You can use a custom public suffix list file by setting CacheFilePath
in fasttld.SuffixListParams{}
to its absolute path.
cacheFilePath := "/absolute/path/to/file.dat"
extractor, _ := fasttld.New(fasttld.SuffixListParams{CacheFilePath: cacheFilePath})
Whenever fasttld.New
is called without specifying CacheFilePath
in fasttld.SuffixListParams{}
, the local cache of the default Public Suffix List is updated automatically if it is more than 3 days old. You can also manually update the cache by using Update()
.
// Automatic update performed if `CacheFilePath` is not specified
// and local cache is more than 3 days old
extractor, _ := fasttld.New(fasttld.SuffixListParams{})
// Manually update local cache
if err := extractor.Update(); err != nil {
log.Println(err)
}
According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.com
and sinaapp.com
.
By default, go-fasttld excludes these private domains (i.e. IncludePrivateSuffix = false
)
extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://google.blogspot.com"
res := extractor.Extract(fasttld.URLParams{URL: url})
// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = google
// res.Domain = blogspot
// res.Suffix = com
// res.RegisteredDomain = blogspot.com
// res.Port = <no output>
// res.Path = <no output>
You can include private domains by setting IncludePrivateSuffix = true
extractor, _ := fasttld.New(fasttld.SuffixListParams{IncludePrivateSuffix: true})
url := "https://google.blogspot.com"
res := extractor.Extract(fasttld.URLParams{URL: url})
// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = google
// res.Suffix = blogspot.com
// res.RegisteredDomain = google.blogspot.com
// res.Port = <no output>
// res.Path = <no output>
You can ignore subdomains by setting IgnoreSubDomains = true
. By default, subdomains are extracted.
extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://maps.google.com"
res := extractor.Extract(fasttld.URLParams{URL: url, IgnoreSubDomains: true})
// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = google
// res.Suffix = com
// res.RegisteredDomain = google.com
// res.Port = <no output>
// res.Path = <no output>
Convert internationalised URLs to punycode before extraction by setting ConvertURLToPunyCode = true
. By default, URLs are not converted to punycode.
extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://hello.世界.com"
res := extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: true})
// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = hello
// res.Domain = xn--rhqv96g
// res.Suffix = com
// res.RegisteredDomain = xn--rhqv96g.com
// res.Port = <no output>
// res.Path = <no output>
res = extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: false})
// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = hello
// res.Domain = 世界
// res.Suffix = com
// res.RegisteredDomain = 世界.com
// res.Port = <no output>
// res.Path = <no output>
go test -v -coverprofile=test_coverage.out && go tool cover -html=test_coverage.out -o test_coverage.html
go test -bench=. -benchmem -cpu 1
Benchmark Name | Source |
---|---|
BenchmarkGoFastTld | go-fasttld (this module) |
BenchmarkJPilloraGoTld | github.com/jpillora/go-tld |
BenchmarkJoeGuoTldExtract | github.com/joeguo/tldextract |
BenchmarkMjd2021USATldExtract | github.com/mjd2021usa/tldextract |
BenchmarkM507Tlde | github.com/M507/tlde |
Benchmarks performed on AMD Ryzen 7 5800X, Manjaro Linux.
go-fasttld performs especially well on longer URLs.
Benchmark Name | Iterations | ns/op | B/op | allocs/op | Fastest |
---|---|---|---|---|---|
BenchmarkGoFastTld | 2540830 | 477.3 ns/op | 224 B/op | 5 allocs/op | |
BenchmarkJPilloraGoTld | 2569042 | 455.9 ns/op | 224 B/op | 2 allocs/op | ✔️ |
BenchmarkJoeGuoTldExtract | 2276013 | 535.6 ns/op | 160 B/op | 5 allocs/op | |
BenchmarkMjd2021USATldExtract | 1367376 | 877.6 ns/op | 208 B/op | 7 allocs/op | |
BenchmarkM507Tlde | 2322066 | 516.6 ns/op | 160 B/op | 5 allocs/op |
https://iupac.org/iupac-announces-the-2021-top-ten-emerging-technologies-in-chemistry/
Benchmark Name | Iterations | ns/op | B/op | allocs/op | Fastest |
---|---|---|---|---|---|
BenchmarkGoFastTld | 2366121 | 497.7 ns/op | 336 B/op | 5 allocs/op | ✔️ |
BenchmarkJPilloraGoTld | 1792764 | 667.8 ns/op | 224 B/op | 2 allocs/op | |
BenchmarkJoeGuoTldExtract | 2041777 | 589.1 ns/op | 272 B/op | 5 allocs/op | |
BenchmarkMjd2021USATldExtract | 1490863 | 803.2 ns/op | 288 B/op | 6 allocs/op | |
BenchmarkM507Tlde | 2065656 | 561.2 ns/op | 272 B/op | 5 allocs/op |
Benchmark Name | Iterations | ns/op | B/op | allocs/op | Fastest |
---|---|---|---|---|---|
BenchmarkGoFastTld | 1663136 | 713.7 ns/op | 832 B/op | 5 allocs/op | ✔️ |
BenchmarkJPilloraGoTld | 445546 | 2600 ns/op | 928 B/op | 4 allocs/op | |
BenchmarkJoeGuoTldExtract | 807241 | 1368 ns/op | 1120 B/op | 6 allocs/op | |
BenchmarkMjd2021USATldExtract | 858139 | 1327 ns/op | 1120 B/op | 6 allocs/op | |
BenchmarkM507Tlde | 747086 | 1373 ns/op | 1120 B/op | 6 allocs/op |