Skip to content

Conversation

@Swiddis
Copy link
Contributor

@Swiddis Swiddis commented Apr 5, 2025

Well, now you got me curious too 😄

I refactored everything to only benchmark the actual differences, so IO is out of the benchmark now. This lets us do more iterations with more accurate measurement since the IO turns out to be expensive and noisy.

I did a few simplifications of the existing methods (mostly just using for-each loops so JVM can do more magic). I figured out how to both optimize the Regex (correct and ~3x as fast, beating the old manual loop) and sped up the manual loop as well.

I also added another lookup-table-based solution which beats out all the others by a few ms, but probably is too fancy to try and implement in the original context :)

Results:

Benchmarking SearchByContains...
Average time: 280.856 ms
Count: 229266

Benchmarking SearchByManualLoop...
Average time: 318.544 ms
Count: 229266

Benchmarking SearchByRegex...
Average time: 516.603 ms
Count: 229266

Benchmarking SearchByArray...
Average time: 211.126 ms
Count: 229266

Answering my original question (ref): it does look like the additional contains checks add a ~45% cost. Only checking the non-whitespace we get:

Benchmarking SearchBySmallContains...
Average time: 192.722 ms
Count: 229266

Of course, this cost should be negligible in context.


It actually surprises me that the manual loop still isn't winning. I found a production-grade C++ CSV serializer and saw that they were doing that approach.

for (auto ch : in) {
    if (ch == Quote || ch == Delim || ch == '\r' || ch == '\n') {
        quote_escape = true;
        break;
    }
}

JVM things, I guess. 🤷

@seankao-az
Copy link

seankao-az commented Apr 5, 2025

too lazy to make pr so i'll just post my finding here

int count = (int) data.stream() // or parallelStream()
        .filter(line -> line.indexOf(',') >= 0 || line.indexOf('\n') >= 0 || 
                        line.indexOf('\r') >= 0 || line.indexOf('"') >= 0)
        .count();
return count;

measured for 50 runs

sh run_benchmark.sh
        This runs a simple benchmark.
        You need javac and java.
 Compiling...
 Starting benchmark!
---------------------
Benchmarking SearchByContains...
Average time: 591.065 ms
Count: 229174

Benchmarking SearchByContainsStream...
Average time: 689.763 ms
Count: 229174

Benchmarking SearchByContainsParallelStream...
Average time: 95.364 ms
Count: 229174

Benchmarking SearchByIndexOf...
Average time: 584.348 ms
Count: 229174

Benchmarking SearchByIndexOfStream...
Average time: 315.658 ms
Count: 229174

Benchmarking SearchByIndexOfParallelStream...
Average time: 56.206 ms
Count: 229174

Benchmarking SearchByManualLoop...
Average time: 825.350 ms
Count: 229174

Benchmarking SearchByRegex...
Average time: 343.118 ms
Count: 229174

Benchmarking SearchByArray...
Average time: 455.885 ms
Count: 229174

@Michael-S Michael-S merged commit bea94d8 into Michael-S:main Apr 5, 2025
@Michael-S
Copy link
Owner

Michael-S commented Apr 5, 2025

Thank you Swiddis! Much cleaner, and you fixed my regex error! My run:

---------------------
Benchmarking SearchByContains...
Average time: 160.932 ms
Count: 229179

Benchmarking SearchByManualLoop...
Average time: 463.657 ms
Count: 229179

Benchmarking SearchByRegex...
Average time: 399.498 ms
Count: 229179

Benchmarking SearchByArray...
Average time: 315.619 ms
Count: 229179

I guess my laptop's i7-11800H is relatively quick, though the fans make it sound like it's trying to be a helicopter when I run this and my battery life was awful from day one.

Thank you also seankao-az, though now I'm going to try to add the stream stuff too. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants