Fast and flexible iteration and filtering of files in `BlobTree`s

I am working on a dataset with many files (over half a million) and with an intricate file structure. I don't need to use all the files at once. In fact, in different phases of the project, I will need to select different subsets of files. For now, I am using information encoded in the file names or in the file path (like the subdirectory under which a file lives) to decide which files to use, but, ideally, I am looking forward to fast and flexible ways to interact with DataSets.jl datasets.

I have created the following toy TOML-embedded dataset to illustrate the case:
```toml
data_config_version = 1

[[datasets]]
description = "Letters data tree embedded in a TOML file"
name = "embedded_letters"
uuid = "b941d775-105a-4606-868b-f81bd02adbe0"
# TOML.print(Dict("datasets" => [Dict("storage" => Dict("data" => Dict("letters" => Dict("a" => Dict("aa" => Dict("aa.file" => "aa file")),"b" => Dict("b.file" => "b file","bb" => Dict("bbb" => Dict("bbb.file" => "bbb file","bbb.filo" => "bbb filo",),),),),),),),],),)

    [datasets.storage]
    driver = "TomlDataStorage"
    type = "BlobTree"

        [datasets.storage.data.letters.b]
        "b.file" = "b file"

            [datasets.storage.data.letters.b.bb.bbb]
            "bbb.file" = "bbb file"
            "bbb.filo" = "bbb filo"

        [datasets.storage.data.letters.a.aa]
        "aa.file" = "aa file"
```
Imagine I need the files such that the string "b.file" is in the file path.

I can take advantage of the abstract tree interface and iterate over [Leaves](https://juliacollections.github.io/AbstractTrees.jl/stable/iteration/#AbstractTrees.Leaves), but this turns out to be very slow.
```julia
using DataSets
using AbstractTrees

ds = open(dataset("embedded_letters"))
for file in Leaves(ds)
    if occursin("b.fil", string(file.path))
        println(file.path)
    end
end
> letters/b/b.file
> letters/b/bb/bbb/bbb.file
> letters/b/bb/bbb/bbb.filo
```

I thought of an alternative approach using the [FileTrees.jl](http://shashi.biz/FileTrees.jl) package. Assuming that I can reproduce the file structure of `embedded_letters` as a FileTrees.jl tree, then I find that filtering files is much faster. (Here, I create the tree manually.)
```julia
using FileTrees
using FilePathsBase

ft = maketree(
    "letters" => [
        "a" => ["aa" => ["aa.file"]],
        "b" => ["b.file", "bb" => ["bbb" => ["bbb.file", "bbb.filo"]]],
    ],
)
ft_filtered = filter(x -> occursin("b.fil", x.name), ft, dirs = false);
for file in files(ft_filtered)
    println(Path(file))
end
> letters/b/b.file
> letters/b/bb/bbb/bbb.file
> letters/b/bb/bbb/bbb.filo
```

I am considering to do the filtering of files using FileTrees.jl, prepare a list of paths as in the example, and finally visit the selected paths in the DataSets.jl dataset. Of course, it would be much better to improve on the DataSets.jl side so that I can avoid the detour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast and flexible iteration and filtering of files in `BlobTree`s #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fast and flexible iteration and filtering of files in BlobTrees #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Fast and flexible iteration and filtering of files in `BlobTree`s #69