Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve text file reading performance #961

Merged
merged 9 commits into from
Mar 29, 2024
Merged

Conversation

zoziha
Copy link
Contributor

@zoziha zoziha commented Sep 1, 2023

Description

  • using smaller buffer size in getline;
  • update read_lines using binary reading;
  • fix CRLF.

Use smaller buffer size in getline

I'm trying to improve the efficiency of reading text files:

  1. removing number_of_rows routine;
  2. using smaller buffer size;
  3. using advance='yes' read.

Local data proves that all three of them can improve read efficiency to some extent. However, they fail to have an order of magnitude improvement effect.
Among them, using a smaller buffer size is the least change to the fpm code, I tested in Windows OS and Ubuntu Linux environment, the two trends are basically the same, the following gives the time-consuming evaluation image under Windows OS and Ubuntu Linux environment:

Time consumed to read a certain 177-line *.f90 file 1000 times:
Compared to 32768, using a smaller line length buffer, such as 1024 (toml-f using 4096), is more in line with fpm's common file read scenarios, and at the same time we can get a 26%~52% read performance improvement.
image
(Win: Windows OS; GFortran: GCC Fortran; IFX: Intel oneAPI ifx)

Pseudocode
use fpm_filesystem, only: read_lines
...
open (1, file='src/readfile.f90', status='old', action='read')
call tmr%tic()
do i = 1, 1000
    rewind (1)
    lines = read_lines(1)
end do
print *, 'Elapsed time: ', tmr%toc(), 's'

Also see this repo.

Related links

@zoziha
Copy link
Contributor Author

zoziha commented Sep 1, 2023

Update read_lines using binary reading

I tried to read text files in C and found it much faster than Fortran. Taking a cue from @Euler-37 , I used the binary way of reading text files, which is the ideal reader, and you can see similar code in fortran-lang/http-client.

Using binary reading ditches the encoding formatting process, and while the original fpm-0.9.0 took 0.7970s to read the file, the current solution only takes 0.062s, an order of magnitude improvement. When I run the command time fpm build --show-model in my local fpm repository:

  • fpm-0.9.0: time consumed 0:01.24 s;
  • this PR: time consumed 0:00.86 s.

That's a 30.65% speedup, which I think is worth celebrating.

@zoziha zoziha marked this pull request as draft September 1, 2023 12:07
@zoziha
Copy link
Contributor Author

zoziha commented Sep 5, 2023

Ensure thread safety

For thread-safety, local allocatable arrays are used to record the start and end indexes of the lines, which reduces performance a bit, but may be able to lay the groundwork for subsequent parallel binary reads.
On Windows, fpm build --show-model has an 18.81% performance improvement.

By the way, I'm posting here a running hotspot diagram (fpm-debug build ---show-model) using Intel Vtune for Windows:
image

@zoziha zoziha marked this pull request as ready for review September 5, 2023 09:03
@zoziha zoziha changed the title Reduce the buffer size in getline Improve text file reading performance Sep 5, 2023
src/fpm_filesystem.F90 Outdated Show resolved Hide resolved
Copy link
Contributor

@perazz perazz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zoziha.

@zoziha
Copy link
Contributor Author

zoziha commented Dec 19, 2023

This PR changes the way fpm reads text files from reading characters by line to reading all binary bytes at once, which may reduce the time it takes to read files, and doesn't change much of fpm's other behavior:

  1. Reduced the cache length in getline to adapt to the fpm scenario;
  2. Add read_text_file binary mode to read the content of the text file.

There is nothing left to update in this PR, and if the change in the way the file is read is considered beneficial, then this PR is passable.

@henilp105
Copy link
Member

@zoziha Is this PR ready to merge ? , I have resolved the conflicts.

Copy link
Member

@henilp105 henilp105 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zoziha , Looks good to me.

@zoziha
Copy link
Contributor Author

zoziha commented Mar 29, 2024

Thanks for reviewing, @henilp105 . Okay, nothing more to add, let's merge it.

@henilp105 henilp105 merged commit d3dd5d4 into fortran-lang:main Mar 29, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants