Skip to content

LocalFileSystem: offset for list_with_offset can't be identified / List results *must* be sorted #388

@timwaizenegger

Description

@timwaizenegger

Describe the bug

The list / [list_with_offset](https://docs.rs/object_store/0.12.0/object_store/trait.ObjectStore.html#method.list_with_offset) functions on LocalFileSystem stores do not (reliably) return results in sorted order.
Now the interface docs for this interface state that:

Note: the order of returned ObjectMeta is not guaranteed

so it would be consistent with that but I believe this is not useful:

  • offset filtering/listing works by skipping any input objects that are (lexicographically) smaller than the offset
  • in order to use offsets to list the all objects in multiple iterations, each iteration must contain a slice of the sorted overall results
  • if I have such a slice, the last element will be correct offset to the get the next slice/batch
  • if results are in a random order overall, offset listing is impossible to use. I can never find any useful offset value out of the ones I received
    • unless I list everything and sort it then. But that defeats the purpose
  • object stores typically list results in sorted order specifically to allow for offset listing
  • Having a usable offset listing feature is key for object-sort to achieve its design goal of a stateless API; the only alternative for users is to use a stateful iterator.

To Reproduce

generate a set of files:

mkdir -p /tmp/manyfiles/
cd /tmp/manyfiles/
    for i in $(seq 1 5000); do
echo "hello world" > "hello world.txt.$i"
done

Run this code to show the (random) ordering and show that offset listing isn't usable:

#[tokio::main]
async fn main() -> Result<()> {
    let store = LocalFileSystem::new_with_prefix("/tmp/manyfiles")?;

    // list() returns a Stream of Result<ObjectMeta>
    let mut list_stream = store.list(None);

    // pull each ObjectMeta out of the stream
    while let Some(result) = list_stream.next().await {
        let meta = result?;
        // print its path
        println!("-> {} - {}", meta.location.to_string(), meta.last_modified.to_string());
    }



    println!("\nListing files in batches of 10 using list_with_offset...");
    let mut offset: Option<String> = None;
    loop {
        // Choose list or list_with_offset based on whether we have an offset
        let mut batch_stream = if let Some(ref off) = offset {
            store.list_with_offset(None, &Path::from(off.as_str()))
        } else {
            store.list(None)
        };

        let mut count = 0;
        let mut last_path: Option<String> = None;
        while let Some(result) = batch_stream.next().await {
            let meta = result?;
            println!("-> {} - {}", meta.location.to_string(), meta.last_modified.to_string());
            count += 1;
            last_path = Some(meta.location.to_string());
            if count >= 10 {
                break;
            }
        }
        // Stop if fewer than 10 items were returned
        if count < 10 {
            break;
        }
        // Update offset for the next batch
        offset = last_path;
    }

    Ok(())
}

Here is an example of the results:

cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.26s
     Running `target/debug/objectstore-list-bug`
-> hello world.txt.822 - 2025-05-29 19:55:29.630128009 UTC
-> hello world.txt.3678 - 2025-05-29 19:55:29.979447372 UTC
-> hello world.txt.1913 - 2025-05-29 19:55:29.765314276 UTC
-> hello world.txt.2598 - 2025-05-29 19:55:29.844496358 UTC
-> hello world.txt.1779 - 2025-05-29 19:55:29.749811871 UTC
-> hello world.txt.3812 - 2025-05-29 19:55:29.994586274 UTC
...

Expected behavior

I should be able to use offset listing with local files just like it work with S3 and others.

Additional context

I found this behavior on macos and redhat linux.

A POC for a possible fix is here: https://github.com/apache/arrow-rs-object-store/pull/389/files

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions