-
Notifications
You must be signed in to change notification settings - Fork 221
Reading data chunk by chunk #987
Comments
Thanks for reaching out and for the explanation, very interesting! You understood it absolutely right: I am assuming that the files are custom formats (e.g. not parquet) and that you know how to read them. In that case, the most performant way to create a primitive array is to read to fn read_uncompressed_buffer<T: NativeType, R: Read>(
reader: &mut R,
length: usize,
is_little_endian: bool,
) -> Result<Vec<T>> {
let bytes = length * std::mem::size_of::<T>();
// it is undefined behavior to call read_exact on un-initialized, https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read
// see also https://github.com/MaikKlein/ash/issues/354#issue-781730580
let mut buffer = vec![T::default(); length];
if is_native_little_endian() == is_little_endian {
// fast case where we can just copy the contents as is
let slice = bytemuck::cast_slice_mut(&mut buffer);
reader.read_exact(slice)?;
} else {
read_swapped(reader, length, &mut buffer, is_little_endian)?;
}
Ok(buffer)
}
fn read_swapped<T: NativeType, R: Read>(
reader: &mut R,
length: usize,
buffer: &mut Vec<T>,
is_little_endian: bool,
) -> Result<()> {
// slow case where we must reverse bits
let mut slice = vec![0u8; length * std::mem::size_of::<T>()];
reader.read_exact(&mut slice)?;
let chunks = slice.chunks_exact(std::mem::size_of::<T>());
if !is_little_endian {
// machine is little endian, file is big endian
buffer
.as_mut_slice()
.iter_mut()
.zip(chunks)
.try_for_each(|(slot, chunk)| {
let a: T::Bytes = match chunk.try_into() {
Ok(a) => a,
Err(_) => unreachable!(),
};
*slot = T::from_be_bytes(a);
Result::Ok(())
})?;
} else {
// machine is big endian, file is little endian
}
Ok(())
}
In other words, read the data into Does it make sense? For Utf8Array (say
and then use |
Thanks for your answer :)
But I will focus based on your advice to read into Vec to more easily save into PrimitiveArray for 1D arrays. I will have to study more the API to convert then into complex and n dimensional arrays. |
A useful way of thinking about arrow2 is:
That is why to build a PrimitiveArray we can just use The format is row/record based. In that case we always need to perform a transposition of the data, as you are showing, which means that we can't load multiple values in one go as I was describing. One way we do transpositions in this case is to declare "mutable columns" and push to them item by item in record. This allows us to only transverse all records once. We do something similar for CSV, see here. That is essentially:
|
Thanks. Also the file I am reading can contain ndarrays, changing by timestamp. I looked at tensor in apache/arrow and several issue tickets. It seems it is not well supported and tested yet so I suppose you are not implementing it yet. But could you advise which type is most suitable to have compatibility with Pandas or Polars ? I have seen in Jira FixedSizeBinary but seems weird to use binary if type is already known. |
While trying to use MutableArray, I am facing issues with missing traits implementation, mainly for MutableFixedSizeListArray with PartialEq and Default. |
Nice, that is interesting! I am also trying to improve the API for COW semantics atm! wrt to Default and PartialEq - no reason, just forgotten - let's add it! Would you like to do it, or would you like me to? |
wrt to the |
Hi |
I am currently integrating further arrow2 into mdfr |
I have a running crate extracting data from a file into ndarray later exposed with PyO3 for big data but I would like to migrate it to arrow instead to benefit performance and arrow ecosystem.
The data blocks read are row based and I currently read several columns in parallel into already allocated ndarrays using rayon from a chunk of data in memory (not whole data block in order to avoid consuming too much memory). Types are mainly primitives but also utf8, complex and multi dimensional arrays of primitive or complex. File bytes can be little or big endians.
Can I have your advice on best method to read this data with arrow2 ? A few thoughts and directions I considered so far:
The text was updated successfully, but these errors were encountered: