-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Describe the bug
The list
/ [list_with_offset](https://docs.rs/object_store/0.12.0/object_store/trait.ObjectStore.html#method.list_with_offset)
functions on LocalFileSystem stores do not (reliably) return results in sorted order.
Now the interface docs for this interface state that:
Note: the order of returned ObjectMeta is not guaranteed
so it would be consistent with that but I believe this is not useful:
- offset filtering/listing works by skipping any input objects that are (lexicographically) smaller than the offset
- in order to use offsets to list the all objects in multiple iterations, each iteration must contain a slice of the sorted overall results
- if I have such a slice, the last element will be correct offset to the get the next slice/batch
- if results are in a random order overall, offset listing is impossible to use. I can never find any useful offset value out of the ones I received
- unless I list everything and sort it then. But that defeats the purpose
- object stores typically list results in sorted order specifically to allow for offset listing
- Having a usable offset listing feature is key for object-sort to achieve its design goal of a stateless API; the only alternative for users is to use a stateful iterator.
To Reproduce
generate a set of files:
mkdir -p /tmp/manyfiles/
cd /tmp/manyfiles/
for i in $(seq 1 5000); do
echo "hello world" > "hello world.txt.$i"
done
Run this code to show the (random) ordering and show that offset listing isn't usable:
#[tokio::main]
async fn main() -> Result<()> {
let store = LocalFileSystem::new_with_prefix("/tmp/manyfiles")?;
// list() returns a Stream of Result<ObjectMeta>
let mut list_stream = store.list(None);
// pull each ObjectMeta out of the stream
while let Some(result) = list_stream.next().await {
let meta = result?;
// print its path
println!("-> {} - {}", meta.location.to_string(), meta.last_modified.to_string());
}
println!("\nListing files in batches of 10 using list_with_offset...");
let mut offset: Option<String> = None;
loop {
// Choose list or list_with_offset based on whether we have an offset
let mut batch_stream = if let Some(ref off) = offset {
store.list_with_offset(None, &Path::from(off.as_str()))
} else {
store.list(None)
};
let mut count = 0;
let mut last_path: Option<String> = None;
while let Some(result) = batch_stream.next().await {
let meta = result?;
println!("-> {} - {}", meta.location.to_string(), meta.last_modified.to_string());
count += 1;
last_path = Some(meta.location.to_string());
if count >= 10 {
break;
}
}
// Stop if fewer than 10 items were returned
if count < 10 {
break;
}
// Update offset for the next batch
offset = last_path;
}
Ok(())
}
Here is an example of the results:
cargo run
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.26s
Running `target/debug/objectstore-list-bug`
-> hello world.txt.822 - 2025-05-29 19:55:29.630128009 UTC
-> hello world.txt.3678 - 2025-05-29 19:55:29.979447372 UTC
-> hello world.txt.1913 - 2025-05-29 19:55:29.765314276 UTC
-> hello world.txt.2598 - 2025-05-29 19:55:29.844496358 UTC
-> hello world.txt.1779 - 2025-05-29 19:55:29.749811871 UTC
-> hello world.txt.3812 - 2025-05-29 19:55:29.994586274 UTC
...
Expected behavior
I should be able to use offset listing with local files just like it work with S3 and others.
Additional context
I found this behavior on macos and redhat linux.
A POC for a possible fix is here: https://github.com/apache/arrow-rs-object-store/pull/389/files