Skip to content

Latest commit

 

History

History
83 lines (64 loc) · 3.52 KB

README.md

File metadata and controls

83 lines (64 loc) · 3.52 KB

TeaFiles

Stable Dev Build Status Coverage Code Style: Blue

Implement the TeaFile format. It is row-oriented, binary, and primarily intended for time-series data.

The primary API is compatible with the Tables.jl interface.

Example

We create the following toy dataset:

using Dates
using DataFrames
x = DataFrame(t=[DateTime(2000), DateTime(2001), DateTime(2002)], a=[1, 2, 3], b=[10.0, 20.0, 30.0])

This produces the following table:

3×3 DataFrame
 Row │ t                    a      b       
     │ DateTime             Int64  Float64 
─────┼─────────────────────────────────────
   1 │ 2000-01-01T00:00:00      1     10.0
   2 │ 2001-01-01T00:00:00      2     20.0
   3 │ 2002-01-01T00:00:00      3     30.0

To write this to disk, we use TeaFiles.write. A tea file contains a header with various metadata, including column names and types which are automatically inferred from the table's schema. Other supported metadata can be specified with optional arguments to TeaFiles.write.

Note that the first column of DateTime type, if present, is used as the primary index for the tea file. As such the values therein must be non-decreasing in order to comply with the specification.

using TeaFiles
TeaFiles.write("moo.tea", x)

The data can be read back with TeaFiles.read, which returns a Tables-compatible object. We can pipe this into the DataFrame constructor to get an object that is equal to the origianl.

TeaFiles.read("moo.tea") |> DataFrame

Reading a sub-interval

If there is a time column, it is guaranteed that its values will be non-decreasing. We can therefore efficiently read a small time interval in a large file by performing a binary search to find the start point. One can specify this interval as an argument to TeaFiles.read, for example:

y = TeaFiles.read("moo.tea"; lower=DateTime(2001)) |> DataFrame
println(y)

gives:

2×3 DataFrame
 Row │ t                    a      b       
     │ DateTime             Int64  Float64 
─────┼─────────────────────────────────────
   1 │ 2001-01-01T00:00:00      2     20.0
   2 │ 2002-01-01T00:00:00      3     30.0

Notes

  • We define the epoch relative to 0001-01-01. The specification states that the reference is 0000-01-01, however this seems to be an error. The example given within the specification, and Python & .NET implementations by DiscreteLogics, are consistent with a reference of 0001-01-01.

  • The specification makes no mention of time zones, and therefore we work with time-zone naive DateTime objects in Julia. Users are recommended to store times in UTC to avoid ambiguities around DST changepoints.

  • We do not plan to support the .NET decimal type (type code 0x200 in the standard).