Data Layers #17

c42f · 2021-06-01T05:06:36Z

Data layers allow data of different formats to be mapped into a program through a decoder and presented with a uniform API such that the main program logic can avoid dealing with data format decoding. Instead, the data format can be defined in the Data.toml.

A challenge here is dealing with world age issues which come up from dynamically requireing Julia packages. For now, we include a bit of judicious Base.invokelatest to make things "just work" in the REPL, but also warn the user that they should add a top-level import.

With this patch and the Data.toml from the tests, we can open several tabular data formats, without the user needing to know much about the data storage.

Here's an example of loading data in .tsv, .gzip.csv and .arrow formats (any of which could then be converted to a DataFrame thanks to the Tables.jl interface)

julia> @! open(dataset("table_tsv"))
┌ Warning: The package CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b] is required to load your dataset. DataSets will import this module for you, but this may not always work as
│ expected.
│ 
│ To silence this message, add import CSV at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
2-element CSV.File{false}:
 CSV.Row: (Name = "Aaron", Age = 23)
 CSV.Row: (Name = "Harry", Age = 42)

julia> @! open(dataset("table_gzip"))
┌ Warning: The package CodecZlib [944b1d66-785c-5afd-91f1-9de20f533193] is required to load your dataset. DataSets will import this module for you, but this may not always
│ work as expected.
│ 
│ To silence this message, add import CodecZlib at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
2-element CSV.File{false}:
 CSV.Row: (Name = "Aaron", Age = 23)
 CSV.Row: (Name = "Harry", Age = 42)

julia> @! open(dataset("table_arrow"))
┌ Warning: The package Arrow [69666777-d1a9-59fb-9406-91d4454c9d45] is required to load your dataset. DataSets will import this module for you, but this may not always work
│ as expected.
│ 
│ To silence this message, add import Arrow at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
Arrow.Table: (Name = ["Aaron", "Harry"], Age = [23, 42])

Excerpt from Data.toml, showing the configuration required for the system to understand these various formats:

[[datasets]]
description="Simple TSV example"
name="table_tsv"
uuid="efde65c3-a898-4ba9-97c1-45dba64b8d46"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.tsv"

    [[datasets.layers]]
    type = "csv"
    [datasets.layers.parameters]
        delim="\t"

[[datasets]]
description="Gzipped CSV example"
name="table_gzip"
uuid="2d126588-5f76-4e53-8245-87dc91625bf4"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.csv.gz"

    [[datasets.layers]]
    type = "gzip"

    [[datasets.layers]]
    type = "csv"

[[datasets]]
description="Arrow example"
name="table_arrow"
uuid="e964d100-fef2-45c4-85de-9d8e142f4084"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.arrow"

    [[datasets.layers]]
    type = "arrow"

More generally than tabular data, here's some further examples of data which comes encoded in many forms, but we'd like to treat through the same data loader API:

Byte streams:

raw
gzip
xz
zstd
...

Images

png
jpeg
tiff
...

Data trees

directories
zip
hdf5
...

StefanKarpinski · 2021-06-01T21:56:39Z

I thought we'd discussed not using @! and making the context explicit instead.

c42f · 2021-06-02T00:48:28Z

I thought we'd discussed not using @! and making the context explicit instead.

Yes, but then we decided to use finalizers instead, where possible, and not expose the context to users at all. That's what was implemented in #12 for Blob and BlobTree (which needed to become mutable as a result).

You'll note that #12 contains no mention of ResourceContexts.jl in the documentation update.

Also, the above is purely optional use of @! — explicit context passing is fine too:

ctx = ResourceContext()

data = open(ctx, dataset("table_tsv"))

c42f · 2021-06-02T00:51:44Z

That's what was implemented in #12 for Blob and BlobTree (which needed to become mutable as a result).

Of course, the issue with the finalizer approach is that it doesn't work with some third-party types such as CSV.CSVFile, which are immutable and can't have finalizers attached. Ideas?

Data layers allow data of different formats to be mapped into a program through a decoder and presented with a uniform API such that the main program logic can avoid dealing with data format decoding. Instead, the data format can be defined in the Data.toml. A challenge here is dealing with world age issues which come up from dynamically `require`ing Julia packages. For now, we include a bit of judicious Base.invokelatest to make things "just work" in the REPL, but also warn the user that they should add a top-level import. Here's some examples of data which comes encoded in many forms, but we'd like to treat through the same API: Tabular data: * csv * gzip.csv * tsv * arrow * parquet * ... Byte streams: * raw * gzip * xz * zstd * ... Images * png * jpeg * tiff * ... Data trees * directories * zip * hdf5

StefanKarpinski · 2021-06-07T13:32:51Z

Ideas?

Return a mutable wrapper object, perhaps? Either that or if the object is immutable, throw an error and require the caller to use the explicit context form (or the @! shorthand).

c42f · 2021-06-09T06:27:04Z

Return a mutable wrapper object, perhaps? Either that or if the object is immutable, throw an error and require the caller to use the explicit context form (or the @! shorthand).

Thanks, I think these are the options. I've been mulling it over but haven't come up with anything else yet.

With wrappers, there seems to be two alternatives

Return something very generic like Ref{T}.
- Pro: Works for all types
- Con: Doesn't have a useful API; must be unwrapped to do anything. Quite clumsy and not similar to API for types which happen to be mutable and don't need wrapping.
- Con: After unwrapping, users will want to drop the wrapper in which case their resources will be closed
Return a wrapper with the right API, for example a hypothetical WrappedTable for tabular data
- Pro: User friendly
- Con: Lots of wrappers to implement, doesn't easily scale to many disparate packages
- Con: Correct API for wrappers may be unclear. In the extreme, just an exact duplicate of the wrapped object.

All together, wrappers don't seem very appealing. I'm inclined to just error and direct the user to the explicit context-based API for the generic code path.

As a hybrid, we could implement a few wrappers for APIs which are relatively well defined and commonly used, eg, tables.

StefanKarpinski · 2021-06-09T13:46:45Z

Honestly it seems most appealing to me to just always require the context object. Once people learn to do this it will always work.

layne-sadler · 2021-08-16T14:43:09Z

hmm. so I could a file that is encrypted + compressed, and layers would allow the program to peel this back to handle that on the fly? what other types of preprocessing could be layers? user-defined layers?

c42f · 2021-08-17T00:59:26Z

so I could a file that is encrypted + compressed, and layers would allow the program to peel this back to handle that on the fly?

Yes, this should be possible. I think the interesting/tricky thing here is having a way to provide parameters to layers. In particular, how would we inject the decryption keys in a secure way? I suppose these are logically a property of the DataSet, but you also don't want to leave keys lying around in memory.

what other types of preprocessing could be layers?

Anything that represents a linear pipeline of decoding stages could be represented. (Conversely, more general DAGs cannot be represented as cleanly — the whole DAG would have to be represented as single non-composable layer.)

user-defined layers?

Yes, in this PR the user should be able to define their own layer by calling DataSets.register_layer! in their third-party module (probably as part of the module's __init__ function) and defining a method with the signature open(layer::DataLayer{:users_custom_tag}, blob::Blob).

mortenpi · 2023-11-30T09:29:13Z

I'll go ahead and close this PR, since I don't think we'll merge it. But the branch and discussion will stay around for future reference.

c42f added 2 commits June 2, 2021 10:57

Data.toml tweaks

6cd6b46

c42f force-pushed the cjf/data-layers branch from 0c7fc0b to 6cd6b46 Compare June 2, 2021 00:58

johnnychen94 mentioned this pull request Aug 30, 2021

Data loading & preprocessing pipeline feature FluxML/Flux.jl#1282

Closed

c42f mentioned this pull request Mar 30, 2022

Data "recipes" JuliaDynamics/DrWatson.jl#337

Open

c42f mentioned this pull request May 6, 2022

The road to DataSets 1.0 #43

Open

15 tasks

mortenpi mentioned this pull request Nov 8, 2022

Processing pipeline on dtype level #55

Open

mortenpi closed this Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Layers #17

Data Layers #17

Uh oh!

c42f commented Jun 1, 2021 •

edited

Loading

Uh oh!

StefanKarpinski commented Jun 1, 2021

Uh oh!

c42f commented Jun 2, 2021 •

edited

Loading

Uh oh!

c42f commented Jun 2, 2021

Uh oh!

StefanKarpinski commented Jun 7, 2021

Uh oh!

c42f commented Jun 9, 2021 •

edited

Loading

Uh oh!

StefanKarpinski commented Jun 9, 2021

Uh oh!

layne-sadler commented Aug 16, 2021 •

edited

Loading

Uh oh!

c42f commented Aug 17, 2021

Uh oh!

mortenpi commented Nov 30, 2023

Uh oh!

Uh oh!

Data Layers #17

Data Layers #17

Uh oh!

Conversation

c42f commented Jun 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanKarpinski commented Jun 1, 2021

Uh oh!

c42f commented Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c42f commented Jun 2, 2021

Uh oh!

StefanKarpinski commented Jun 7, 2021

Uh oh!

c42f commented Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanKarpinski commented Jun 9, 2021

Uh oh!

layne-sadler commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c42f commented Aug 17, 2021

Uh oh!

mortenpi commented Nov 30, 2023

Uh oh!

Uh oh!

c42f commented Jun 1, 2021 •

edited

Loading

c42f commented Jun 2, 2021 •

edited

Loading

c42f commented Jun 9, 2021 •

edited

Loading

layne-sadler commented Aug 16, 2021 •

edited

Loading