This tool demonstrates file handling in Perl, focusing on streaming discipline for GB-scale files without memory exhaustion.
Large log files and data exports don't fit in memory. Traditional tools either:
- Load the entire file (crash on large files)
- Provide no progress feedback
- Handle interruption poorly
stream-scan reads line-by-line with constant memory usage:
┌─────────────────────────────────────────┐
│ File: 50GB │
│ ████████████████████░░░░░░░░░░░░░░░░░░░ │
│ [42.5%] 1.2M lines, 847 matches @ 125MB/s│
└─────────────────────────────────────────┘
Memory footprint:
- Single line buffer (~KB)
- Context buffer if -B/-A used (bounded)
- Match storage (optional via callbacks)
cd stream-scan
perl -Ilib bin/stream-scan --help# Search for errors
stream-scan 'ERROR|FATAL' /var/log/huge.log
# Case-insensitive
stream-scan -i 'timeout' access.log
# Invert match (lines NOT containing pattern)
stream-scan -v 'DEBUG' app.log# Show progress with throughput stats
stream-scan -P 'exception' 50gb-logfile.log
# Scan compressed files via pipe
zcat huge.log.gz | stream-scan -P 'error'
# Monitor growing log
tail -f /var/log/app.log | stream-scan 'critical'# 3 lines before and after (like grep -C)
stream-scan -C 3 'NullPointer' app.log
# 5 lines before only
stream-scan -B 5 'FATAL' error.log
# 2 lines after only
stream-scan -A 2 'Started' boot.log# Count matches only
stream-scan -c 'failed' auth.log
# List files with matches
stream-scan -l 'TODO' src/*.pl
# Quiet mode (exit status only)
stream-scan -q 'secret' config.txt && echo "Found!"# Start at line 1000
stream-scan --start-line 1000 'error' huge.log
# Lines 1000-2000 only
stream-scan --start-line 1000 --end-line 2000 'error' huge.log
# First 100 matches only
stream-scan -m 100 'warning' verbose.log# Extract fields 2 and 4 from CSV matches
stream-scan -F ',' -f 2,4 'ERROR' data.csv
# Tab-separated
stream-scan -F '\t' -f 1,3 'fail' data.tsvuse StreamScan;
# Basic usage
my $scanner = StreamScan->new(
pattern => qr/ERROR|FATAL/,
progress => 1,
);
my $result = $scanner->scan_file('/var/log/app.log');
# With callbacks (constant memory)
my $scanner = StreamScan->new(
pattern => qr/ERROR/,
on_match => sub {
my $match = shift;
print "$match->{line_num}: $match->{line}\n";
},
on_progress => sub {
my $info = shift;
printf "\r%d lines, %d matches",
$info->{lines_read}, $info->{matches};
},
);
$scanner->scan_file($path);
# Custom predicate
my $scanner = StreamScan->new(
predicate => sub {
my $line = shift;
return length($line) > 1000; # Lines over 1KB
},
);| Option | Description |
|---|---|
-e PATTERN |
Regex pattern to match |
-i |
Case-insensitive |
-v |
Invert match |
-c |
Count only |
-l |
List files with matches |
-n |
Show line numbers (default: on) |
-B NUM |
Lines before match |
-A NUM |
Lines after match |
-C NUM |
Context lines (before + after) |
-m NUM |
Stop after NUM matches |
-P |
Show progress indicator |
-q |
Quiet mode |
-F SEP |
Field separator |
-f LIST |
Fields to extract (1-indexed) |
| Code | Meaning |
|---|---|
| 0 | Matches found |
| 1 | No matches |
| 2 | Error |
Tested with production log files:
| File Size | Lines | Time | Memory |
|---|---|---|---|
| 1 GB | 8M | 12s | 4 MB |
| 10 GB | 80M | 2m | 4 MB |
| 50 GB | 400M | 10m | 4 MB |
Memory stays constant regardless of file size.
Included is generate-test-data for creating test files with controlled match rates:
# Generate 100MB log with 1% ERROR lines
bin/generate-test-data -s 100M -t log -r 0.01 -m ERROR -o test.log
# Generate 1GB Apache log with progress
bin/generate-test-data -s 1G -t apache -P -o access.log
# Test directly via pipe
bin/generate-test-data -s 50M -r 0.005 -m FATAL | bin/stream-scan -P FATAL
# Reproducible output
bin/generate-test-data -s 10M --seed 42 -o deterministic.log| Type | Description |
|---|---|
log |
Application log format (default) |
apache |
Apache access log format |
syslog |
Syslog format |
json |
JSON lines |
csv |
CSV with header |
simple |
Basic text lines |
encoding |
Mixed valid/invalid UTF-8 (for utf8-doctor testing) |
The -r option controls what fraction of lines contain the match pattern:
# 0.1% match rate (1 in 1000 lines)
generate-test-data -s 100M -r 0.001 -m CRITICAL -o sparse.log
# 50% match rate (stress test)
generate-test-data -s 10M -r 0.5 -m WARNING -o dense.logprove -l t/Tests cover:
- Pattern matching
- Inverted matches
- Line ranges
- Max matches
- Context lines (before/after)
- Custom predicates
- Count mode
- Large file streaming
- Progress callbacks
- Throughput stats
- Line-by-line reading: Never slurp entire file
- Bounded context buffer: O(context_lines), not O(file_size)
- Optional match storage: Use callbacks for true constant memory
- Signal handling: Clean Ctrl+C exit preserves partial results
- Progress as callback: Customizable, testable, not hardcoded
| Scenario | stream-scan | grep |
|---|---|---|
| File > RAM | Yes | May fail |
| Progress needed | Yes | No |
| Clean interrupt | Yes | Partial |
| Custom predicates | Yes | No |
| Field extraction | Built-in | Needs cut/awk |
Ed Bates — TECHBLIP LLC
Licensed under the Apache License, Version 2.0.