Another method + improvements #1

Swiddis · 2025-04-05T19:04:12Z

Well, now you got me curious too 😄

I refactored everything to only benchmark the actual differences, so IO is out of the benchmark now. This lets us do more iterations with more accurate measurement since the IO turns out to be expensive and noisy.

I did a few simplifications of the existing methods (mostly just using for-each loops so JVM can do more magic). I figured out how to both optimize the Regex (correct and ~3x as fast, beating the old manual loop) and sped up the manual loop as well.

I also added another lookup-table-based solution which beats out all the others by a few ms, but probably is too fancy to try and implement in the original context :)

Results:

Benchmarking SearchByContains...
Average time: 280.856 ms
Count: 229266

Benchmarking SearchByManualLoop...
Average time: 318.544 ms
Count: 229266

Benchmarking SearchByRegex...
Average time: 516.603 ms
Count: 229266

Benchmarking SearchByArray...
Average time: 211.126 ms
Count: 229266

Answering my original question (ref): it does look like the additional contains checks add a ~45% cost. Only checking the non-whitespace we get:

Benchmarking SearchBySmallContains...
Average time: 192.722 ms
Count: 229266

Of course, this cost should be negligible in context.

It actually surprises me that the manual loop still isn't winning. I found a production-grade C++ CSV serializer and saw that they were doing that approach.

for (auto ch : in) {
    if (ch == Quote || ch == Delim || ch == '\r' || ch == '\n') {
        quote_escape = true;
        break;
    }
}

JVM things, I guess. 🤷

seankao-az · 2025-04-05T21:04:39Z

too lazy to make pr so i'll just post my finding here

int count = (int) data.stream() // or parallelStream()
        .filter(line -> line.indexOf(',') >= 0 || line.indexOf('\n') >= 0 || 
                        line.indexOf('\r') >= 0 || line.indexOf('"') >= 0)
        .count();
return count;

measured for 50 runs

sh run_benchmark.sh
        This runs a simple benchmark.
        You need javac and java.
 Compiling...
 Starting benchmark!
---------------------
Benchmarking SearchByContains...
Average time: 591.065 ms
Count: 229174

Benchmarking SearchByContainsStream...
Average time: 689.763 ms
Count: 229174

Benchmarking SearchByContainsParallelStream...
Average time: 95.364 ms
Count: 229174

Benchmarking SearchByIndexOf...
Average time: 584.348 ms
Count: 229174

Benchmarking SearchByIndexOfStream...
Average time: 315.658 ms
Count: 229174

Benchmarking SearchByIndexOfParallelStream...
Average time: 56.206 ms
Count: 229174

Benchmarking SearchByManualLoop...
Average time: 825.350 ms
Count: 229174

Benchmarking SearchByRegex...
Average time: 343.118 ms
Count: 229174

Benchmarking SearchByArray...
Average time: 455.885 ms
Count: 229174

Michael-S · 2025-04-05T23:28:31Z

Thank you Swiddis! Much cleaner, and you fixed my regex error! My run:

---------------------
Benchmarking SearchByContains...
Average time: 160.932 ms
Count: 229179

Benchmarking SearchByManualLoop...
Average time: 463.657 ms
Count: 229179

Benchmarking SearchByRegex...
Average time: 399.498 ms
Count: 229179

Benchmarking SearchByArray...
Average time: 315.619 ms
Count: 229179

I guess my laptop's i7-11800H is relatively quick, though the fans make it sound like it's trying to be a helicopter when I run this and my battery life was awful from day one.

Thank you also seankao-az, though now I'm going to try to add the stream stuff too. 😄

Swiddis added 4 commits April 5, 2025 11:34

Move everything to benchmarks + fix regex

0966a64

Clean up benchmark script

229ff1e

Manual loop: use char and foreach instead of codepoints

eb6d66e

New method: search by Lookup Table

1350279

Michael-S merged commit bea94d8 into Michael-S:main Apr 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Another method + improvements #1

Another method + improvements #1

Uh oh!

Swiddis commented Apr 5, 2025 •

edited

Loading

Uh oh!

seankao-az commented Apr 5, 2025 •

edited

Loading

Uh oh!

Michael-S commented Apr 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Another method + improvements #1

Another method + improvements #1

Uh oh!

Conversation

Swiddis commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seankao-az commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Michael-S commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Swiddis commented Apr 5, 2025 •

edited

Loading

seankao-az commented Apr 5, 2025 •

edited

Loading

Michael-S commented Apr 5, 2025 •

edited

Loading