Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign ReadStatMeta and add ReadStatColMeta for DataAPI.jl v1.13 #6

Merged
merged 7 commits into from
Nov 17, 2022

Conversation

nalimilan
Copy link
Contributor

@nalimilan nalimilan commented May 27, 2022

Use the metadata function that is going to be added to DataAPI (JuliaData/DataFrames.jl#3055) to expose the contents of ReadStatMeta. This will allow e.g. DataFrame to import the metadata.

The implementation is not super efficient as a new Dict is allocated for each column. If this turns out to be a problem it would be possible to create a special AbstractDict type that would expose the data without allocating, but this may be overkill. Also, if users mutate the returned metadata Dicts, the changes will have no effect on ReadStatMeta, except if they mutate the Dicts holding value labels. We could copy these Dicts for safety, as mutating them for one variable will affect all other variables that use the same value labels.

Cc: @bkamins

@bkamins
Copy link

bkamins commented May 27, 2022

@nalimilan Regarding performance, as you have mentioned this several times. Do you think anyone will notice such delay if one has 10,000 column data frame to work with:

julia> @benchmark Dict(i => Dict{String, Any}() for i in 1:10_000)
BenchmarkTools.Trial: 4141 samples with 1 evaluation.
 Range (min … max):  921.400 μs …   7.387 ms  ┊ GC (min … max):  0.00% … 69.63%
 Time  (median):     979.800 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.204 ms ± 610.899 μs  ┊ GC (mean ± σ):  17.04% ± 20.13%

  ▇█▆▄▂▂                    ▂                      ▁            ▁
  ████████▇▅▆▅▅▄▅▃▄▃▅▁▁▃▁▁▃███▇▇██▇▇▅▅▅▅▅▃▅▃▃▅▃▆▇█████▇▆▆▅▅▆▅▅▆ █
  921 μs        Histogram: log(frequency) by time       3.54 ms <

 Memory estimate: 5.15 MiB, allocs estimate: 40007.

@bkamins
Copy link

bkamins commented May 27, 2022

(note - we do not materialize it in DataFrames.jl unless needed so normally even this cost is not paid)

@nalimilan
Copy link
Contributor Author

An overhead of about 1ms sounds reasonable (and 10,000 columns is relatively large). But it would be worth benchmarking the PR to check this.

@junyuan-chen
Copy link
Owner

Thanks for the great work on metadata! With the v1.12 release of DataAPI.jl and v1.4 release of DataFrames.jl, now it is the time to adopt the new interface.

Since the changes made here are already outdated, I am starting to rewrite the part on metadata. There is no good reason to keep the old design with ReadStatMeta. So, I will simply make some breaking changes that will lead to v0.2.

@bkamins
Copy link

bkamins commented Nov 6, 2022

@junyuan-chen thank you! Just please note that in JuliaData/DataAPI.jl#56 some minimal extensions to the metadata API are added (nothing breaking, just small changes that make using metadata more convenient; they should be merged probably in the coming week). Thank you!

@junyuan-chen
Copy link
Owner

@bkamins Thanks for mentioning that! I will keep that in mind.

Use the `metadata` function recently added to DataAPI to expose
the contents of `ReadStatMeta`. This will allow e.g. `DataFrame` to import
the metadata.

The implementation is not super efficient as a new `Dict` is allocated for
each column. If this turns out to be a problem it would be possible to
create a special `AbstractDict` type that would expose the data without
allocating, but this may be overkill. Also, if users mutate the returned
metadata `Dict`s, the changes will have no effect on `ReadStatMeta`,
except if they mutate the `Dicts` holding value labels.
Project.toml Outdated
@@ -13,7 +13,7 @@ Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

[compat]
CategoricalArrays = "0.10"
DataAPI = "1.6"
DataAPI = "1.11"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prepare this PR against DataAPI.jl 1.13 (it will be released this week)

src/table.jl Outdated
@@ -123,6 +123,53 @@ Retrieve the metadata parsed from a data file.
"""
getmeta(tb::ReadStatTable) = getfield(tb, :meta)

hasmetadata(tb::ReadStatTable) = getmeta(tb) !== nothing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be metadatasupport and it takes type as an argument.

src/table.jl Outdated
hasmetadata(tb::ReadStatTable) = getmeta(tb) !== nothing
hasmetadata(tb::ReadStatTable, col::Symbol) = getmeta(tb) !== nothing && haskey(tb, col)

function metadata(tb::ReadStatTable)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also style kwarg should be supported.

src/table.jl Outdated
return ret
end

function metadata(tb::ReadStatTable, col::Symbol)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be colmetadata

@bkamins
Copy link

bkamins commented Nov 7, 2022

@nalimilan - do you want me to update this PR after DataAPI.jl 1.13 is out, or you want to do it (to practice the API 😄)?

@junyuan-chen junyuan-chen marked this pull request as draft November 7, 2022 15:53
@junyuan-chen
Copy link
Owner

@nalimilan - do you want me to update this PR after DataAPI.jl 1.13 is out, or you want to do it (to practice the API 😄)?

@bkamins Thanks for looking into that! But, I forgot to mention that the one I pushed yesterday is still the old one written back in May. I have started rewriting it for the DataAPI.jl v1.13 and turned this PR to draft.

@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Merging #6 (d47bc11) into main (6b93b77) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main        #6   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          237       300   +63     
=========================================
+ Hits           237       300   +63     
Impacted Files Coverage Δ
src/ReadStatTables.jl 100.00% <ø> (ø)
src/readstat.jl 100.00% <100.00%> (ø)
src/table.jl 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@junyuan-chen junyuan-chen marked this pull request as ready for review November 9, 2022 03:13
@junyuan-chen
Copy link
Owner

junyuan-chen commented Nov 9, 2022

My attempt on the rewrite is now complete (except for documentation pages).

Here is a summary of main changes:

  1. A ReadStatTable always contains metadata with a fixed set of keys and value types that are processed by ReadStat.
  2. ReadStatMeta is split into ReadStatMeta and ReadStatColMeta for file-level and variable-level metadata respectively. They are the types for objects returned by metadata and colmetadata and can be iterated as key-value pairs.
  3. ReadStatColMeta is immutable. However, with StructVector, the metadata for each column can still be modified in-place as the underlying data are still stored in arrays.
  4. Considering the changes to be included in v1.13 of DataAPI.jl, additional iterators MetaStyleView and ColMetaIterator are defined for lazy behavior that does not require allocating a Dict. The former is for the case when metadata style is needed; the latter is for iterating metadata by column.
  5. Additionally, values of column-specific metadata for a specific key across different columns can be iterated using the array returned by colmetavalues.
  6. It is possible to specify the styles of metadata (to be anything other than :default). But, for column-specific metadata, the style for the metadata with the same key has to be the same across all columns as the style returned for metadata is only determined by the key but not the column. (This design is for simplicity and should suffice for typical usage.)

@nalimilan @bkamins

@junyuan-chen junyuan-chen changed the title Expose metadata via DataAPI.metadata Redesign ReadStatMeta and add ReadStatColMeta for DataAPI.jl v1.13 Nov 17, 2022
@junyuan-chen junyuan-chen added the breaking Breaking changes are involved label Nov 17, 2022
@junyuan-chen junyuan-chen merged commit a3cc59b into junyuan-chen:main Nov 17, 2022
@nalimilan nalimilan deleted the nl/metadata branch December 4, 2022 22:16
@nalimilan
Copy link
Contributor Author

Sorry, I had missed this. Thanks for finishing it!

It would make sense to mark labels with the :note style by default IMO, as these are the typical usage for :note.

@bkamins
Copy link

bkamins commented Dec 4, 2022

This is what RData.jl also does - right?

@junyuan-chen
Copy link
Owner

It would make sense to mark labels with the :note style by default IMO, as these are the typical usage for :note.

@nalimilan @bkamins Sure. Let me do this change for the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking changes are involved enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants