High CPU consumption when using "-b" with large embedded binary data

Usually when processing a batch of raw images, I'm either IO limited, or else the external exiftool process is CPU bound. In these circumstances on my machine I can read metadata at 10-25 files per second.

I have some raw files with the "QuickTime:JpgFromRaw" tag, which contains a roughly 3MB embedded jpeg version of the raw image. When I process this with the "-b" flag enabled, my throughput drops to less than one file per second. Process Monitor shows that my python process is CPU bound.

Profiling the code shows that the vast majority of time is spent in `_read_fd_endswith`. I suspected the line `output += os.read(fd, block_size)` so I tried factoring it out into a separate function so the profiler could measure it, and indeed, it was the culprit.

For small amounts of data this isn't a problem, but repeatedly concatenating like this is "accidentally quadratic"--every time you add to the buffer, you have to copy the previous contents of the buffer.

I did a quick and dirty test of maintaining a list of buffers `output_list=[b'']`, adding new data with `output_list.append(os.read(fd,block_size))` and returning them at the end with `b"".join(output_list)`. This did fix the slowdown. The catch is that I was only looking at the most recent data chunk when looking for the termination string. This happens to work since exiftool apparently flushes its buffers at the end of the write before printing the end sentinel. But, that relies on undocumented behvaior of exiftool, so I don't love it as a solution.

For my own current purposes, it's good enough. I'd like to contribute back, though. If I implement a version with robust logic for matching "b_endswith", would you be interested in a pull request?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High CPU consumption when using "-b" with large embedded binary data #60

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

High CPU consumption when using "-b" with large embedded binary data #60

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions