Open
Conversation
Two-worker pcntl_fork architecture that splits CSV at newline-aligned midpoint. Each worker uses 1MB buffered fread with explode line splitting and O(1) comma detection via strlen-26 (exploiting fixed 25-char timestamp suffix). Results merged via igbinary/serialize IPC through temp files. Inner dates sorted with ksort; outer key order preserved (insertion order). gc_disable eliminates unnecessary cyclic GC overhead.
|
Benchmarks finished with success. Workflow run: https://github.com/tempestphp/100-million-row-challenge/actions/runs/22422666557 |
…trpos scanning, binary IPC, and manual JSON output
Replace associative-array architecture with flat integer indexing for the hot
path. A discover phase scans the first 1 MB to map paths and dates to integer
IDs in first-encounter order. processSegment now uses in-buffer strpos()
scanning instead of explode("\n"), and counts into a flat array indexed by
pathId * dateCount + dateId. Workers communicate via pack('V*') binary IPC
(~1.9 MB per child vs ~50-140 MB with serialize). Manual JSON streaming
output matches json_encode(JSON_PRETTY_PRINT) byte-for-byte. Prefers
/dev/shm for temp files with sys_get_temp_dir() fallback. Parallelism
increased from 2 to 4 workers. Local macOS benchmark: 8.55s for 100M rows.
Author
|
Benchmarks finished with failure. Workflow run: https://github.com/tempestphp/100-million-row-challenge/actions/runs/22424378208 |
1 MB scan (~13.6K lines) has ~5.6% chance of missing any given date out of ~1818 unique dates, statistically losing ~1 date and ~55K rows. Increased to 8 MB (~109K lines) where miss probability is ~0.
Member
|
Benchmarking complete! Mean execution time: 12.58167775836s |
Member
|
You've improved your result! Have a cookie: 🍪 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Approach
Flat-integer-array architecture with discover phase, 4 parallel workers,
in-buffer strpos() scanning, and binary IPC.
Key optimizations
to sequential integer IDs in first-encounter order
counts[pathId * dateCount + dateId]replacesnested associative arrays — eliminates hash-table overhead in the hot path
pcntl_fork()with newline-aligned file segment splittingread buffer without
explode("\n")intermediate string arraysvs ~50-140 MB with serialize) via temp files
json_encode(JSON_PRETTY_PRINT)byte-for-byte, including
\/escapingsys_get_temp_dir()fallbackAdditional optimizations
strlen($line) - 26exploits the fixed 25-charISO 8601 timestamp suffix
lineLen > 45skips malformed/empty linesCorrectness
/in path keysksort()— preserves first-encounter insertion orderksort()json_encode(JSON_PRETTY_PRINT)byte-for-bytedata:validatepasses