-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Usually when processing a batch of raw images, I'm either IO limited, or else the external exiftool process is CPU bound. In these circumstances on my machine I can read metadata at 10-25 files per second.
I have some raw files with the "QuickTime:JpgFromRaw" tag, which contains a roughly 3MB embedded jpeg version of the raw image. When I process this with the "-b" flag enabled, my throughput drops to less than one file per second. Process Monitor shows that my python process is CPU bound.
Profiling the code shows that the vast majority of time is spent in _read_fd_endswith. I suspected the line output += os.read(fd, block_size) so I tried factoring it out into a separate function so the profiler could measure it, and indeed, it was the culprit.
For small amounts of data this isn't a problem, but repeatedly concatenating like this is "accidentally quadratic"--every time you add to the buffer, you have to copy the previous contents of the buffer.
I did a quick and dirty test of maintaining a list of buffers output_list=[b''], adding new data with output_list.append(os.read(fd,block_size)) and returning them at the end with b"".join(output_list). This did fix the slowdown. The catch is that I was only looking at the most recent data chunk when looking for the termination string. This happens to work since exiftool apparently flushes its buffers at the end of the write before printing the end sentinel. But, that relies on undocumented behvaior of exiftool, so I don't love it as a solution.
For my own current purposes, it's good enough. I'd like to contribute back, though. If I implement a version with robust logic for matching "b_endswith", would you be interested in a pull request?