Skip to content

go-fasttld is a high performance top level domains (TLD) extraction module implemented with compressed tries.

License

Notifications You must be signed in to change notification settings

dotwoo/go-fasttld

 
 

go-fasttld

Go Reference Go Report Card Codecov Coverage

GitHub license

go-fasttld is a high performance top level domains (TLD) extraction module implemented with compressed tries.

This module is a port of the Python fasttld module, with additional modifications to support extraction of subcomponents from full URLs and IPv4 addresses.

Trie

Background

go-fasttld extracts subcomponents like top level domains (TLDs), subdomains and hostnames from URLs efficiently by using the regularly-updated Mozilla Public Suffix List and the compressed trie data structure.

For example, it extracts the com TLD, maps subdomain, and google domain from https://maps.google.com:8080/a/long/path/?query=42.

go-fasttld also supports extraction of private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com', and extraction of IPv4 addresses (e.g. https://127.0.0.1).

Why not split on "." and take the last element instead?

Splitting on "." and taking the last element only works for simple TLDs like .com, but not more complex ones like oseto.nagasaki.jp.

Installation

go get github.com/elliotwutingfeng/go-fasttld

Quick Start

Full demo available in the examples folder

// Initialise fasttld extractor
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

//Extract URL subcomponents
url := "https://some-user@a.long.subdomain.ox.ac.uk:5000/a/b/c/d/e/f/g/h/i?id=42"
res := extractor.Extract(fasttld.URLParams{URL: url})

// Display results
fmt.Println(res.Scheme)           // https://
fmt.Println(res.UserInfo)         // some-user
fmt.Println(res.SubDomain)        // a.long.subdomain
fmt.Println(res.Domain)           // ox
fmt.Println(res.Suffix)           // ac.uk
fmt.Println(res.RegisteredDomain) // ox.ac.uk
fmt.Println(res.Port) // 5000
fmt.Println(res.Path) // a/b/c/d/e/f/g/h/i?id=42

Public Suffix List options

Specify custom public suffix list file

You can use a custom public suffix list file by setting CacheFilePath in fasttld.SuffixListParams{} to its absolute path.

cacheFilePath := "/absolute/path/to/file.dat"
extractor, _ := fasttld.New(fasttld.SuffixListParams{CacheFilePath: cacheFilePath})

Updating the default Public Suffix List cache

Whenever fasttld.New is called without specifying CacheFilePath in fasttld.SuffixListParams{}, the local cache of the default Public Suffix List is updated automatically if it is more than 3 days old. You can also manually update the cache by using Update().

// Automatic update performed if `CacheFilePath` is not specified
// and local cache is more than 3 days old
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

// Manually update local cache
if err := extractor.Update(); err != nil {
    log.Println(err)
}

Private domains

According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.com and sinaapp.com.

By default, go-fasttld excludes these private domains (i.e. IncludePrivateSuffix = false)

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url := "https://google.blogspot.com"
res := extractor.Extract(fasttld.URLParams{URL: url})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = google
// res.Domain = blogspot
// res.Suffix = com
// res.RegisteredDomain = blogspot.com
// res.Port = <no output>
// res.Path = <no output>

You can include private domains by setting IncludePrivateSuffix = true

extractor, _ := fasttld.New(fasttld.SuffixListParams{IncludePrivateSuffix: true})

url := "https://google.blogspot.com"
res := extractor.Extract(fasttld.URLParams{URL: url})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = google
// res.Suffix = blogspot.com
// res.RegisteredDomain = google.blogspot.com
// res.Port = <no output>
// res.Path = <no output>

Extraction options

Ignore Subdomains

You can ignore subdomains by setting IgnoreSubDomains = true. By default, subdomains are extracted.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url := "https://maps.google.com"
res := extractor.Extract(fasttld.URLParams{URL: url, IgnoreSubDomains: true})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = <no output>
// res.Domain = google
// res.Suffix = com
// res.RegisteredDomain = google.com
// res.Port = <no output>
// res.Path = <no output>

Punycode

Convert internationalised URLs to punycode before extraction by setting ConvertURLToPunyCode = true. By default, URLs are not converted to punycode.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})

url := "https://hello.世界.com"
res := extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: true})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = hello
// res.Domain = xn--rhqv96g
// res.Suffix = com
// res.RegisteredDomain = xn--rhqv96g.com
// res.Port = <no output>
// res.Path = <no output>

res = extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: false})

// res.Scheme = https://
// res.UserInfo = <no output>
// res.SubDomain = hello
// res.Domain = 世界
// res.Suffix = com
// res.RegisteredDomain = 世界.com
// res.Port = <no output>
// res.Path = <no output>

Testing

go test -v -coverprofile=test_coverage.out && go tool cover -html=test_coverage.out -o test_coverage.html

Benchmarks

go test -bench=. -benchmem -cpu 1

Modules used

Benchmark Name Source
BenchmarkGoFastTld go-fasttld (this module)
BenchmarkJPilloraGoTld github.com/jpillora/go-tld
BenchmarkJoeGuoTldExtract github.com/joeguo/tldextract
BenchmarkMjd2021USATldExtract github.com/mjd2021usa/tldextract
BenchmarkM507Tlde github.com/M507/tlde

Results

Benchmarks performed on AMD Ryzen 7 5800X, Manjaro Linux.

go-fasttld performs especially well on longer URLs.


#1

https://news.google.com

Benchmark Name Iterations ns/op B/op allocs/op Fastest
BenchmarkGoFastTld 2540830 477.3 ns/op 224 B/op 5 allocs/op
BenchmarkJPilloraGoTld 2569042 455.9 ns/op 224 B/op 2 allocs/op ✔️
BenchmarkJoeGuoTldExtract 2276013 535.6 ns/op 160 B/op 5 allocs/op
BenchmarkMjd2021USATldExtract 1367376 877.6 ns/op 208 B/op 7 allocs/op
BenchmarkM507Tlde 2322066 516.6 ns/op 160 B/op 5 allocs/op

#2

https://iupac.org/iupac-announces-the-2021-top-ten-emerging-technologies-in-chemistry/

Benchmark Name Iterations ns/op B/op allocs/op Fastest
BenchmarkGoFastTld 2366121 497.7 ns/op 336 B/op 5 allocs/op ✔️
BenchmarkJPilloraGoTld 1792764 667.8 ns/op 224 B/op 2 allocs/op
BenchmarkJoeGuoTldExtract 2041777 589.1 ns/op 272 B/op 5 allocs/op
BenchmarkMjd2021USATldExtract 1490863 803.2 ns/op 288 B/op 6 allocs/op
BenchmarkM507Tlde 2065656 561.2 ns/op 272 B/op 5 allocs/op

#3

https://www.google.com/maps/dir/Parliament+Place,+Parliament+House+Of+Singapore,+Singapore/Parliament+St,+London,+UK/@25.2440033,33.6721455,4z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x31da19a0abd4d71d:0xeda26636dc4ea1dc!2m2!1d103.8504863!2d1.2891543!1m5!1m1!1s0x487604c5aaa7da5b:0xf13a2197d7e7dd26!2m2!1d-0.1260826!2d51.5017061!3e4

Benchmark Name Iterations ns/op B/op allocs/op Fastest
BenchmarkGoFastTld 1663136 713.7 ns/op 832 B/op 5 allocs/op ✔️
BenchmarkJPilloraGoTld 445546 2600 ns/op 928 B/op 4 allocs/op
BenchmarkJoeGuoTldExtract 807241 1368 ns/op 1120 B/op 6 allocs/op
BenchmarkMjd2021USATldExtract 858139 1327 ns/op 1120 B/op 6 allocs/op
BenchmarkM507Tlde 747086 1373 ns/op 1120 B/op 6 allocs/op

Acknowledgements

About

go-fasttld is a high performance top level domains (TLD) extraction module implemented with compressed tries.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 100.0%