Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISSCUSSION] the memory tracking approach that better handles shared buffers #6439

Open
haohuaijin opened this issue Sep 23, 2024 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@haohuaijin
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

See this comment #6363 (comment),
when the Buffer's data is shared between many Buffer, the capacity() method will always return the total buffer memory usage, which causes the issue #6363, we need discussion the better memory track for shared buffers.

pub fn capacity(&self) -> usize {
self.data.capacity()
}

Describe the solution you'd like

The easy way to do this is to return the length and the unused capacity of the Buffer like pr #6438,

    pub fn capacity(&self) -> usize {
        self.length + self.data.capacity() - self.data.len()
    }
          This is changing what these methods report to something else that I'm not sure is correct. Whilst this formulation may benefit your use-case, there are common situations where it will now under report... Perhaps you might file a ticket to discuss a memory tracking approach that better handles shared buffers, as that is the actual issue here. This is not a bug

Originally posted by @tustvold in #6438 (comment)

But as @tustvold said, we need more discussion.

Describe alternatives you've considered

Additional context

@haohuaijin haohuaijin added the enhancement Any new improvement worthy of a entry in the changelog label Sep 23, 2024
@tustvold
Copy link
Contributor

#3960 is likely related.

Tagging @jhorstmann @waynexia @alamb

@waynexia
Copy link
Member

I personally prefer the allocator approach which gives the most accurate stats in theory. Given we in some aspects encourage sharing underlying buffer and arrays, only changing the behavior of .capacity() would still cause problems for end users when they want to track how many physical memories are used somewhere (e.g., in one query session).

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

I personally prefer the allocator approach which gives the most accurate stats in theory.

I agree with @waynexia -- if the usecase is "accurately track the total memory used across some number of Arrays which (potentially) share underlying Buffers", I think this needs a hook into the underlying allocator. There is proposed PR #6336 which might be worth a look (I haven't had a chance to really understand it yet myself)

@kszlim
Copy link
Contributor

kszlim commented Oct 19, 2024

I don't know if this is the right place to post this, but if this is being reworked, it would also be very useful to have a total memory usage method available within builders, especially for nested stuff. I'm writing a parser which outputs record batches at a static number of rows, but that doesn't work very well when the goal is to keep memory usage roughly constant. It would be nice to be able to get a roughly accurate measure of memory usage in builders in addition to arrays/buffers.

@tustvold
Copy link
Contributor

tustvold commented Oct 19, 2024

Builders are an interesting point, but come with two potential challenges to be aware of:

  • Builders tend to over-allocate to avoid repeatedly reallocating
  • Updating memory accounting per element is likely prohibitively expensive

This means any "eager" memory computation is likely to work poorly, however, the lazy approach as currently implemented for arrays could actually work quite well.

I could see a world where we remove lazy memory tracking for arrats in favour of eager allocation tracking, and add lazy memory tracking for builders

@kszlim
Copy link
Contributor

kszlim commented Oct 19, 2024

That would be super useful for my use case

@tustvold
Copy link
Contributor

tustvold commented Oct 19, 2024

As this appears to have stalled slightly I have created a POC in #6590 for what I think eager memory tracking for arrays could look like.

I don't really have the capacity to drive this initiative, but I think the approach in the PR should be relatively straightforward for someone to pick up and run with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

5 participants