Improve text file reading performance #961

zoziha · 2023-09-01T08:07:25Z

Description

using smaller buffer size in getline;
update read_lines using binary reading;
fix CRLF.

Use smaller buffer size in getline

I'm trying to improve the efficiency of reading text files:

removing number_of_rows routine;
using smaller buffer size;
using advance='yes' read.

Local data proves that all three of them can improve read efficiency to some extent. However, they fail to have an order of magnitude improvement effect.
Among them, using a smaller buffer size is the least change to the fpm code, I tested in Windows OS and Ubuntu Linux environment, the two trends are basically the same, the following gives the time-consuming evaluation image under Windows OS and Ubuntu Linux environment:

Time consumed to read a certain 177-line *.f90 file 1000 times:
Compared to 32768, using a smaller line length buffer, such as 1024 (toml-f using 4096), is more in line with fpm's common file read scenarios, and at the same time we can get a 26%~52% read performance improvement.

(Win: Windows OS; GFortran: GCC Fortran; IFX: Intel oneAPI ifx)

Pseudocode

use fpm_filesystem, only: read_lines
...
open (1, file='src/readfile.f90', status='old', action='read')
call tmr%tic()
do i = 1, 1000
    rewind (1)
    lines = read_lines(1)
end do
print *, 'Elapsed time: ', tmr%toc(), 's'

Also see this repo.

Update read_lines using binary reading

I tried to read text files in C and found it much faster than Fortran. Taking a cue from @Euler-37 , I used the binary way of reading text files, which is the ideal reader, and you can see similar code in fortran-lang/http-client.

Using binary reading ditches the encoding formatting process, and while the original fpm-0.9.0 took 0.7970s to read the file, the current solution only takes 0.062s, an order of magnitude improvement. When I run the command time fpm build --show-model in my local fpm repository:

fpm-0.9.0: time consumed 0:01.24 s;
this PR: time consumed 0:00.86 s.

That's a 30.65% speedup, which I think is worth celebrating.

zoziha · 2023-09-05T09:03:20Z

Ensure thread safety

For thread-safety, local allocatable arrays are used to record the start and end indexes of the lines, which reduces performance a bit, but may be able to lay the groundwork for subsequent parallel binary reads.
On Windows, fpm build --show-model has an 18.81% performance improvement.

By the way, I'm posting here a running hotspot diagram (fpm-debug build ---show-model) using Intel Vtune for Windows:

src/fpm_filesystem.F90

perazz

LGTM, thanks @zoziha.

zoziha · 2023-12-19T17:53:30Z

This PR changes the way fpm reads text files from reading characters by line to reading all binary bytes at once, which may reduce the time it takes to read files, and doesn't change much of fpm's other behavior:

Reduced the cache length in getline to adapt to the fpm scenario;
Add read_text_file binary mode to read the content of the text file.

There is nothing left to update in this PR, and if the change in the way the file is read is considered beneficial, then this PR is passable.

henilp105 · 2024-03-29T05:57:23Z

@zoziha Is this PR ready to merge ? , I have resolved the conflicts.

henilp105

Thanks @zoziha , Looks good to me.

zoziha · 2024-03-29T06:30:46Z

Thanks for reviewing, @henilp105 . Okay, nothing more to add, let's merge it.

reduce the buffer size in getline

c049115

zoziha requested review from awvwgk and urbanjost September 1, 2023 08:07

improve read_lines: use binary reading

a6da02b

zoziha marked this pull request as draft September 1, 2023 12:07

zoziha added 2 commits September 1, 2023 20:17

fix read_lines in list_files

ea84821

read_lines uses the same static array idx

1feebe6

zoziha force-pushed the buffer-1 branch from 3a0abd0 to 1feebe6 Compare September 1, 2023 15:45

zoziha added 3 commits September 5, 2023 16:38

fix CRLF

c0b8643

add split_first_last

92b6e50

fix read_lines

a16f4b5

zoziha marked this pull request as ready for review September 5, 2023 09:03

zoziha changed the title ~~Reduce the buffer size in getline~~ Improve text file reading performance Sep 5, 2023

perazz reviewed Dec 19, 2023

View reviewed changes

src/fpm_filesystem.F90 Outdated Show resolved Hide resolved

perazz approved these changes Dec 19, 2023

View reviewed changes

add read_text_file

067cc3c

Merge branch 'main' into buffer-1

4cf8c21

henilp105 approved these changes Mar 29, 2024

View reviewed changes

henilp105 merged commit d3dd5d4 into fortran-lang:main Mar 29, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve text file reading performance #961

Improve text file reading performance #961

zoziha commented Sep 1, 2023 •

edited

Loading

zoziha commented Sep 1, 2023 •

edited

Loading

zoziha commented Sep 5, 2023 •

edited

Loading

perazz left a comment

zoziha commented Dec 19, 2023

henilp105 commented Mar 29, 2024

henilp105 left a comment

zoziha commented Mar 29, 2024

Improve text file reading performance #961

Improve text file reading performance #961

Conversation

zoziha commented Sep 1, 2023 • edited Loading

Description

Use smaller buffer size in getline

Related links

zoziha commented Sep 1, 2023 • edited Loading

Update read_lines using binary reading

zoziha commented Sep 5, 2023 • edited Loading

Ensure thread safety

perazz left a comment

Choose a reason for hiding this comment

zoziha commented Dec 19, 2023

henilp105 commented Mar 29, 2024

henilp105 left a comment

Choose a reason for hiding this comment

zoziha commented Mar 29, 2024

zoziha commented Sep 1, 2023 •

edited

Loading

zoziha commented Sep 1, 2023 •

edited

Loading

zoziha commented Sep 5, 2023 •

edited

Loading