Skip to content

Fix async image load #2006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Mar 15, 2022
Merged

Conversation

kroymann
Copy link
Contributor

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This addresses the issue outlined in #1997 where Image.LoadAsync(stream) would incorrectly use synchronous IO if the stream is seekable. The fix is simply to remove the optimization that avoids prebuffering the stream into an in-memory buffer if the stream is seekable, and instead always prebuffer the stream. This ensures that async IO is always used to read the stream.

kroymann and others added 5 commits February 14, 2022 16:01
@JimBobSquarePants
Copy link
Member

Everything looks good so far from an implementation perspective. We'll need to figure out a strategy now to compare the new Async performance with Main.

@JimBobSquarePants
Copy link
Member

@antonfirsov I just pushed a change to ChunkedMemoryStream that allows the size of the pooled buffer chunks to scale with the total allocation.

Allocation blocks match this scale.

LargeChunks = Min(4Mb, MemoryAllocator.BufferCapacityInBytes())
LargeChunkThreshold = Min(1Mb, LargeChunk / 4)
SmallChunks = Min(128Kb, LargeChunks / 32)

Once the total allocation within the stream hits our threshold we immediately switch to allocating larger chunk sizes.

I chose the the the values for the following reasons

  • Small @ 128Kb because that's the default chunk size RecycledMemoryStream uses.
  • Threshold @ 1Mb because that's the threshold of the ArrayPool. I was tempted to go for 512Kb here.
  • Large @ 4Mb because that matches the best performing pool size for our current allocator.

I'm not precious about any of these values to if you feel like they should change then please suggest alternatives.

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this but want a second opinion before merging.

@antonfirsov
Copy link
Member

antonfirsov commented Feb 26, 2022

I want to take a look next week. Quite distracted from everything programming-related thanks to the ongoing Fourth Reich experiment on my continent, right in the neighborhood.

@JimBobSquarePants
Copy link
Member

Of course mate. Absolute horror show over there!

@antonfirsov
Copy link
Member

antonfirsov commented Mar 13, 2022

@kroymann can you please give me permission to push to your branch? (Thought it is always allowed for contributors for PR branches.)
I want to push my load test changes:
antonfirsov@63e897e

As an alternative, feel free to cherry pick it.

@antonfirsov
Copy link
Member

antonfirsov commented Mar 13, 2022

I did before / after runs of asnchronous load testing:

  • It looks like I can repro the thread pool starvation on main : The synchronous parallel run finishes in about 6.3 seconds on my 10 core machine, while the async version needs ~9.2 seconds.
  • The PR seems to fix this nicely, bringing down async execution time close to the synchronous time
  • Unfortunately, there is a significant ~25% memory regression

We should try to address it. Will take quick look at the code changes now.

Before

image

After

image

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have no instant idea to improve this.

If you guys both agree that 25% extra memory footprint is an acceptable price for correct async, I'm fine merging this. We can open an issue to track the sync/async memory gap.

Just please remember to merge my load test addition!

@antonfirsov
Copy link
Member

Btw @kroymann thanks for discovering the problem, and pushing through the fix, really great job!

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Mar 14, 2022

@antonfirsov Your permissions issues were likely due to Git LFS issues with GitHub.

There's a workaround that should allow you to push to the fork.

rm .git/hooks/pre-push

I'm assuming the memory overheads come from my buffer allocation? Would a drop to 2MB segments be useful?

@kroymann
Copy link
Contributor Author

@antonfirsov I think @JimBobSquarePants is right about the github permissions, but I went ahead and cherry-picked your commit just to expedite things. Thank you for adding those tests!

@antonfirsov
Copy link
Member

antonfirsov commented Mar 14, 2022

I'm assuming the memory overheads come from my buffer allocation? Would a drop to 2MB segments be useful?

2MB logical allocations lead to 4MB actual allocations since buffers are uniform.

Actually, there might be a way to create a smarter buffer growing strategy.

This is how subsequent chunks sizes are allocated currently if I get it right.

128K -> 128K -> 128K -> 128K -> 128K -> 128K -> 128K -> 128K ->
4MB -> 4MB -> 4MB  -> ...

This stressing the 128K ArrayPool unnecessarily, and often leads to a wasteful allocation in the end.

Instead, I would try to do something like:

128K -> 128K -> 128K -> 128K ->
256K -> 256K ->  256K -> 256K ->
512K -> 512K ->  512K -> 512K ->
1MB -> 1MB -> 1MB -> 1MB -> 
[skip 2MB]
4MB -> 4MB -> 4MB  -> 4MB  -> 4MB  -> 4MB  -> 4MB  -> 4MB  ->...

This would keep most of the buffers rented from ArrayPool.Shared without stressing individual buckets too much and avoid wasting space. The unmanaged pool wouldn't be touched for images smaller than 7 MB.

@JimBobSquarePants
Copy link
Member

Yeah, that's how it's allocated currently. Shouldn't be too hard to update it to follow your pattern.

@antonfirsov
Copy link
Member

antonfirsov commented Mar 14, 2022

Here is a closed formula to get the i-th chunk size according to my logic, assuming you increment i each time you allocate a chunk:

private static int GetChunkSize(int i)
{
    const int _128K = 1 << 17;
    const int _4M = 1 << 22;
    return i < 16 ? _128K * (1 << (i / 4)) : _4M;
}

Tomorrow is the last day I have access to my benchmark machine to redo the benchmarks, I will leave for two weeks afterwards.
PS: realized that tomorrow already started in Australia 😆

@JimBobSquarePants
Copy link
Member

It's always tomorrow in Aus!

That's the changes pushed. I added a check to ensure we always use the min of the allocator capacity or the calculated size, in case we change things in the future or someone implements their own allocator.

@antonfirsov
Copy link
Member

antonfirsov commented Mar 14, 2022

Looks like there is a slight improvement. The total memory gap seems to be around 20% now.

image

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the patience!

Co-authored-by: Anton Firszov <antonfir@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Signifies a binary breaking change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants