GitHub - k0ality/ip-addr-counter: HyperLogLog IPv4 counter

How I approached the task

Re-read info on IPv4 representation (32-bit integer) and limits (2^32) to estimate maximum unique count. Took note to convert each line to minimize space: ~15 bytes (IP + newline) to 4 bytes (uint32).
Downloaded test file and checked the number of lines (wc -l) = 8B (8,000,000,000)
Assumed constant for all solutions:
- I/O time = file size / SSD throughput = 120 GB / 500 MB/s = 240 seconds = 4 minutes
- Additional memory for Go runtime, file buffer, stack/heap = ~16 MB
Did back of the envelope calculations for space needed for exact counting (Hash Set):
- Memory = records * uint32 size * hash table overhead (~2) = 8B * 4 * 2 = 64 GB
Did back of the envelope calculations for bitmap approach:
- Possible IPv4 count = 2^32
- Memory = records * 1 bit / 8 (to convert to bytes) = 2^32 * 1 / 8 = 536,870,912 bytes = 512 MB
- CPU usage (single-threaded) Parsing a string, calculating the bit position, and inserting into a map are simple operations that will take nanoseconds. Let's assume we will be able to process 30M records per second. CPU Time = 8B / 30M = 267 sec = ~ 5 min with realistic overhead.
  Feasible solution. But I felt like there should be a better trade-off.
Dug deeper. First thought about Bloom filters (not applicable here), from there headed off to linear counting algorithm -> LogLog -> HyperLogLog. To calculate HLL memory usage, I had to take the precision parameter into account. Precision determines the tradeoff between accuracy and memory.
b < 4 useless estimates since the error is too high
b > 18 diminishing returns, increased cache misses - better to switch to a bitmap solution, since we arrive at 512 MB

Precision (b) Registers (m) Memory Standard Error

4 16 16 B ~26%

10 1,024 1 KB ~3.2%

12 4,096 4 KB ~1.6%

14 16,384 16 KB ~0.81%

16 65,536 64 KB ~0.40%

18 262,144 256 KB ~0.20%
- CPU usage (single-threaded) There are more operations than in the bitmap solution, and they are more complex, specifically hashing. Let's assume we will be able to process only 10M records per second. CPU Time = 8B / 10M = 800 sec = ~ 14 min with realistic overhead
Optimized for multi-threaded processing:

Mode	Workers	Time (estimate)	Memory	Speedup
Single-threaded	1	~14 min	16 KB	1x
Parallel	4	~4-5 min	64 KB	3-3.5x
Parallel	8	~2-3 min	128 KB	5-6x

P.S. Streaming HLL & HLL++ optimizations seem to be out of scope for the given example

Usage

make help

Local

go build -o ip-counter main.go
./ip-counter input.txt 14

Docker

docker build -t ip-counter .
docker run -v $(PWD)/data:/data ip-counter /data/ip_addresses 14 --parallel --workers 8

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
file_processor.go		file_processor.go
generate_test_data.go		generate_test_data.go
go.mod		go.mod
hash.go		hash.go
hll.go		hll.go
ipv4.go		ipv4.go
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How I approached the task

Usage

Local

Docker

About

Uh oh!

Releases

Packages

Languages

Precision (b)	Registers (m)	Memory	Standard Error
4	16	16 B	~26%
10	1,024	1 KB	~3.2%
12	4,096	4 KB	~1.6%
14	16,384	16 KB	~0.81%
16	65,536	64 KB	~0.40%
18	262,144	256 KB	~0.20%

k0ality/ip-addr-counter

Folders and files

Latest commit

History

Repository files navigation

How I approached the task

Usage

Local

Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages