Skip to content

Memory-safe streaming file analyzer for GB-scale files with progress tracking and clean interruption

License

Notifications You must be signed in to change notification settings

edbzed/stream-scan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stream-scan

License Perl CLI Platform

This tool demonstrates file handling in Perl, focusing on streaming discipline for GB-scale files without memory exhaustion.

Problem It Solves

Large log files and data exports don't fit in memory. Traditional tools either:

  • Load the entire file (crash on large files)
  • Provide no progress feedback
  • Handle interruption poorly

How It Works

stream-scan reads line-by-line with constant memory usage:

┌─────────────────────────────────────────┐
│ File: 50GB                              │
│ ████████████████████░░░░░░░░░░░░░░░░░░░ │
│ [42.5%] 1.2M lines, 847 matches @ 125MB/s│
└─────────────────────────────────────────┘

Memory footprint:

  • Single line buffer (~KB)
  • Context buffer if -B/-A used (bounded)
  • Match storage (optional via callbacks)

Installation

cd stream-scan
perl -Ilib bin/stream-scan --help

Usage

Basic Search

# Search for errors
stream-scan 'ERROR|FATAL' /var/log/huge.log

# Case-insensitive
stream-scan -i 'timeout' access.log

# Invert match (lines NOT containing pattern)
stream-scan -v 'DEBUG' app.log

Progress & Large Files

# Show progress with throughput stats
stream-scan -P 'exception' 50gb-logfile.log

# Scan compressed files via pipe
zcat huge.log.gz | stream-scan -P 'error'

# Monitor growing log
tail -f /var/log/app.log | stream-scan 'critical'

Context Lines

# 3 lines before and after (like grep -C)
stream-scan -C 3 'NullPointer' app.log

# 5 lines before only
stream-scan -B 5 'FATAL' error.log

# 2 lines after only
stream-scan -A 2 'Started' boot.log

Counting & Files

# Count matches only
stream-scan -c 'failed' auth.log

# List files with matches
stream-scan -l 'TODO' src/*.pl

# Quiet mode (exit status only)
stream-scan -q 'secret' config.txt && echo "Found!"

Line Ranges

# Start at line 1000
stream-scan --start-line 1000 'error' huge.log

# Lines 1000-2000 only
stream-scan --start-line 1000 --end-line 2000 'error' huge.log

# First 100 matches only
stream-scan -m 100 'warning' verbose.log

Field Extraction

# Extract fields 2 and 4 from CSV matches
stream-scan -F ',' -f 2,4 'ERROR' data.csv

# Tab-separated
stream-scan -F '\t' -f 1,3 'fail' data.tsv

Perl API

use StreamScan;

# Basic usage
my $scanner = StreamScan->new(
    pattern   => qr/ERROR|FATAL/,
    progress  => 1,
);
my $result = $scanner->scan_file('/var/log/app.log');

# With callbacks (constant memory)
my $scanner = StreamScan->new(
    pattern  => qr/ERROR/,
    on_match => sub {
        my $match = shift;
        print "$match->{line_num}: $match->{line}\n";
    },
    on_progress => sub {
        my $info = shift;
        printf "\r%d lines, %d matches",
            $info->{lines_read}, $info->{matches};
    },
);
$scanner->scan_file($path);

# Custom predicate
my $scanner = StreamScan->new(
    predicate => sub {
        my $line = shift;
        return length($line) > 1000;  # Lines over 1KB
    },
);

Options

Option Description
-e PATTERN Regex pattern to match
-i Case-insensitive
-v Invert match
-c Count only
-l List files with matches
-n Show line numbers (default: on)
-B NUM Lines before match
-A NUM Lines after match
-C NUM Context lines (before + after)
-m NUM Stop after NUM matches
-P Show progress indicator
-q Quiet mode
-F SEP Field separator
-f LIST Fields to extract (1-indexed)

Exit Codes

Code Meaning
0 Matches found
1 No matches
2 Error

Performance

Tested with production log files:

File Size Lines Time Memory
1 GB 8M 12s 4 MB
10 GB 80M 2m 4 MB
50 GB 400M 10m 4 MB

Memory stays constant regardless of file size.

Synthetic Data Generator

Included is generate-test-data for creating test files with controlled match rates:

# Generate 100MB log with 1% ERROR lines
bin/generate-test-data -s 100M -t log -r 0.01 -m ERROR -o test.log

# Generate 1GB Apache log with progress
bin/generate-test-data -s 1G -t apache -P -o access.log

# Test directly via pipe
bin/generate-test-data -s 50M -r 0.005 -m FATAL | bin/stream-scan -P FATAL

# Reproducible output
bin/generate-test-data -s 10M --seed 42 -o deterministic.log

Data Types

Type Description
log Application log format (default)
apache Apache access log format
syslog Syslog format
json JSON lines
csv CSV with header
simple Basic text lines
encoding Mixed valid/invalid UTF-8 (for utf8-doctor testing)

Match Rate Control

The -r option controls what fraction of lines contain the match pattern:

# 0.1% match rate (1 in 1000 lines)
generate-test-data -s 100M -r 0.001 -m CRITICAL -o sparse.log

# 50% match rate (stress test)
generate-test-data -s 10M -r 0.5 -m WARNING -o dense.log

Running Tests

prove -l t/

Tests cover:

  • Pattern matching
  • Inverted matches
  • Line ranges
  • Max matches
  • Context lines (before/after)
  • Custom predicates
  • Count mode
  • Large file streaming
  • Progress callbacks
  • Throughput stats

Design Decisions

  1. Line-by-line reading: Never slurp entire file
  2. Bounded context buffer: O(context_lines), not O(file_size)
  3. Optional match storage: Use callbacks for true constant memory
  4. Signal handling: Clean Ctrl+C exit preserves partial results
  5. Progress as callback: Customizable, testable, not hardcoded

When to Use Over grep

Scenario stream-scan grep
File > RAM Yes May fail
Progress needed Yes No
Clean interrupt Yes Partial
Custom predicates Yes No
Field extraction Built-in Needs cut/awk

See Also

Author

Ed Bates — TECHBLIP LLC

License

Licensed under the Apache License, Version 2.0.

About

Memory-safe streaming file analyzer for GB-scale files with progress tracking and clean interruption

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages