stream-scan

This tool demonstrates file handling in Perl, focusing on streaming discipline for GB-scale files without memory exhaustion.

Problem It Solves

Large log files and data exports don't fit in memory. Traditional tools either:

Load the entire file (crash on large files)
Provide no progress feedback
Handle interruption poorly

How It Works

stream-scan reads line-by-line with constant memory usage:

┌─────────────────────────────────────────┐
│ File: 50GB                              │
│ ████████████████████░░░░░░░░░░░░░░░░░░░ │
│ [42.5%] 1.2M lines, 847 matches @ 125MB/s│
└─────────────────────────────────────────┘

Memory footprint:

Single line buffer (~KB)
Context buffer if -B/-A used (bounded)
Match storage (optional via callbacks)

Installation

cd stream-scan
perl -Ilib bin/stream-scan --help

Usage

Basic Search

# Search for errors
stream-scan 'ERROR|FATAL' /var/log/huge.log

# Case-insensitive
stream-scan -i 'timeout' access.log

# Invert match (lines NOT containing pattern)
stream-scan -v 'DEBUG' app.log

Progress & Large Files

# Show progress with throughput stats
stream-scan -P 'exception' 50gb-logfile.log

# Scan compressed files via pipe
zcat huge.log.gz | stream-scan -P 'error'

# Monitor growing log
tail -f /var/log/app.log | stream-scan 'critical'

Context Lines

# 3 lines before and after (like grep -C)
stream-scan -C 3 'NullPointer' app.log

# 5 lines before only
stream-scan -B 5 'FATAL' error.log

# 2 lines after only
stream-scan -A 2 'Started' boot.log

Counting & Files

# Count matches only
stream-scan -c 'failed' auth.log

# List files with matches
stream-scan -l 'TODO' src/*.pl

# Quiet mode (exit status only)
stream-scan -q 'secret' config.txt && echo "Found!"

Line Ranges

# Start at line 1000
stream-scan --start-line 1000 'error' huge.log

# Lines 1000-2000 only
stream-scan --start-line 1000 --end-line 2000 'error' huge.log

# First 100 matches only
stream-scan -m 100 'warning' verbose.log

Field Extraction

# Extract fields 2 and 4 from CSV matches
stream-scan -F ',' -f 2,4 'ERROR' data.csv

# Tab-separated
stream-scan -F '\t' -f 1,3 'fail' data.tsv

Perl API

use StreamScan;

# Basic usage
my $scanner = StreamScan->new(
    pattern   => qr/ERROR|FATAL/,
    progress  => 1,
);
my $result = $scanner->scan_file('/var/log/app.log');

# With callbacks (constant memory)
my $scanner = StreamScan->new(
    pattern  => qr/ERROR/,
    on_match => sub {
        my $match = shift;
        print "$match->{line_num}: $match->{line}\n";
    },
    on_progress => sub {
        my $info = shift;
        printf "\r%d lines, %d matches",
            $info->{lines_read}, $info->{matches};
    },
);
$scanner->scan_file($path);

# Custom predicate
my $scanner = StreamScan->new(
    predicate => sub {
        my $line = shift;
        return length($line) > 1000;  # Lines over 1KB
    },
);

Options

Option	Description
`-e PATTERN`	Regex pattern to match
`-i`	Case-insensitive
`-v`	Invert match
`-c`	Count only
`-l`	List files with matches
`-n`	Show line numbers (default: on)
`-B NUM`	Lines before match
`-A NUM`	Lines after match
`-C NUM`	Context lines (before + after)
`-m NUM`	Stop after NUM matches
`-P`	Show progress indicator
`-q`	Quiet mode
`-F SEP`	Field separator
`-f LIST`	Fields to extract (1-indexed)

Exit Codes

Code	Meaning
0	Matches found
1	No matches
2	Error

Performance

Tested with production log files:

File Size	Lines	Time	Memory
1 GB	8M	12s	4 MB
10 GB	80M	2m	4 MB
50 GB	400M	10m	4 MB

Memory stays constant regardless of file size.

Synthetic Data Generator

Included is generate-test-data for creating test files with controlled match rates:

# Generate 100MB log with 1% ERROR lines
bin/generate-test-data -s 100M -t log -r 0.01 -m ERROR -o test.log

# Generate 1GB Apache log with progress
bin/generate-test-data -s 1G -t apache -P -o access.log

# Test directly via pipe
bin/generate-test-data -s 50M -r 0.005 -m FATAL | bin/stream-scan -P FATAL

# Reproducible output
bin/generate-test-data -s 10M --seed 42 -o deterministic.log

Data Types

Type	Description
`log`	Application log format (default)
`apache`	Apache access log format
`syslog`	Syslog format
`json`	JSON lines
`csv`	CSV with header
`simple`	Basic text lines
`encoding`	Mixed valid/invalid UTF-8 (for utf8-doctor testing)

Match Rate Control

The -r option controls what fraction of lines contain the match pattern:

# 0.1% match rate (1 in 1000 lines)
generate-test-data -s 100M -r 0.001 -m CRITICAL -o sparse.log

# 50% match rate (stress test)
generate-test-data -s 10M -r 0.5 -m WARNING -o dense.log

Running Tests

prove -l t/

Tests cover:

Pattern matching
Inverted matches
Line ranges
Max matches
Context lines (before/after)
Custom predicates
Count mode
Large file streaming
Progress callbacks
Throughput stats

Design Decisions

Line-by-line reading: Never slurp entire file
Bounded context buffer: O(context_lines), not O(file_size)
Optional match storage: Use callbacks for true constant memory
Signal handling: Clean Ctrl+C exit preserves partial results
Progress as callback: Customizable, testable, not hardcoded

When to Use Over grep

Scenario	stream-scan	grep
File > RAM	Yes	May fail
Progress needed	Yes	No
Clean interrupt	Yes	Partial
Custom predicates	Yes	No
Field extraction	Built-in	Needs cut/awk

Author

Ed Bates — TECHBLIP LLC

License

Licensed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
lib		lib
t		t
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
curl_headers.txt		curl_headers.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stream-scan

Problem It Solves

How It Works

Installation

Usage

Basic Search

Progress & Large Files

Context Lines

Counting & Files

Line Ranges

Field Extraction

Perl API

Options

Exit Codes

Performance

Synthetic Data Generator

Data Types

Match Rate Control

Running Tests

Design Decisions

When to Use Over grep

See Also

Author

License

About

Uh oh!

Releases 1

Packages

Languages

License

edbzed/stream-scan

Folders and files

Latest commit

History

Repository files navigation

stream-scan

Problem It Solves

How It Works

Installation

Usage

Basic Search

Progress & Large Files

Context Lines

Counting & Files

Line Ranges

Field Extraction

Perl API

Options

Exit Codes

Performance

Synthetic Data Generator

Data Types

Match Rate Control

Running Tests

Design Decisions

When to Use Over grep

See Also

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages