-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
impl std::fmt::Write for StringViewBuilder / BinaryViewBuilder #6777
base: main
Are you sure you want to change the base?
Conversation
I think this will perform identically to just writing to a temporary I wonder if we could do something a bit more sophisticated, but then again perhaps it won't make a difference. I like the Drop based flush, although it does introduce a bit of API inconsistency... |
@@ -88,6 +86,9 @@ pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> { | |||
/// map `<string hash> -> <index to the views>` | |||
string_tracker: Option<(HashTable<usize>, ahash::RandomState)>, | |||
phantom: PhantomData<T>, | |||
/// temporary space used by [`ByteViewFormatter`] | |||
/// to avoid extra allocations | |||
temp_buffer: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation allocates a single String which is used for all write!
calls to the same StringViewBuilder, which should keep allocations to a mininum.
However, this implementation will copy the strings twice:
- once into the temp buffer
String
- once into the final view
For short strings (less than 12 bytes) this copy is likely required anyways (unless we are manipulating the view buffer directly)
For longer strings (greater than 12 bytes), we could avoid the extra copy into the String
buffer by writing directly into the underlying buffer.
I actually tried to avoid the second copy, but found it was quite complex due to:
- Having to respect the string_tracker (deduplication)
- handle the block size re-allocation correctly
Thus I went with the simple approach here which should get most of the benefits, and figured we can optimize further if needed.
/// - String buffer count exceeds `u32::MAX` | ||
/// - String length exceeds `u32::MAX` | ||
#[inline] | ||
pub fn append_str(&mut self, value: &str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a fancier way to write the generics to avoid introdicing a append_str
and append_bytes
API, but I couldn't find it.
/// | ||
/// Safety: caller must ensure value is a valid native type (e.g. valid | ||
/// UTF-8 for `StringViewType`). | ||
unsafe fn append_bytes(&mut self, v: &[u8]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially mark this as pub
too but I figured I would keep the proposed API changes small
Yes that is my understanding -- this API just attaches the temporary
I tried doing something more sophisticated (specifically building up the view incrementally, and writing to the buffer) but I gave up due to #6777 (comment) (which I forgot to submit yesterday)
Yeah, it is different than the StringBuilder API. I could add the same |
/// | ||
/// Panics if | ||
/// - String buffer count exceeds `u32::MAX` | ||
/// - String length exceeds `u32::MAX` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would likely run out of stack space first :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe heap space 🤔
Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com>
I believe it is not clear we want to pursue this particular pattern, so marking it as draft for nw |
Which issue does this PR close?
Closes #6373
Rationale for this change
@tlm365 is trying to implement cast from various types to StringView efficiently in
Int
,UInt
, etc) toUtf8View
#6719Also, DataFusion had several places where it would like to write to a StringViewBuilder using standard rust formatting such as
write!
What changes are included in this PR?
Open questions / follow on PRs:
Are there any user-facing changes?
Yes, new APIs + docs