Skip to content

[EPIC] Proposal for Date/Time enhancement #3100

Closed
@waitingkuo

Description

@waitingkuo

!!! Please correct me if i'm wrong !!!

Intro

Design Principle

  1. align with postgresql in most cases
  2. use nanoseconds for the default timestamp resolution (unlike postgresql and pyspark's microseconds)
  3. use utc+0 as default timezone for timestamp with time zone

Let's Begin with Postgresql's Date/Time

image

Let's Start to Compare


Timestamp

Postgresql

  • uses 8 bytes for both timestamp and timestamp with time zone. (note that time zone is included in these 8 bytes)
  • uses microsecond as resolution, which is the number of microseconds from 1970-01-01T00:00:00
  • has 7 kinds of floating point precision, from timestamp(0) to timestamp(6)
    • timestamp(0) rounds to seconds
    • timestamp(3) rounds to milliseconds
    • timestamp(6) rounds to microseconds
  • timestamp 'xxx' output timestamp
  1. if xxx does't contain time zone info, it just works as what you think
willy=# select timestamp '2000-01-01T00:00:00';
      timestamp      
---------------------
 2000-01-01 00:00:00
(1 row)
  1. if xxx contains time zone info, time zone is just ignored. (i believe that this is a surprise for some people) e.g.
willy=# select timestamp '2000-01-01T00:00:00+08:00';
      timestamp      
---------------------
 2000-01-01 00:00:00
(1 row)
  • timestamp with time zone 'xxx' output timestamp with time
    1 if xxx contains no time zone, it assume it's local time
willy=# select timestamp with time zone '2000-01-01T00:00:00';
      timestamptz       
------------------------
 2000-01-01 00:00:00+08
(1 row)

2 if xxx contains time zone, it'll be converted to your local time zone

willy=# select timestamp with time zone '2000-01-01T00:00:00+02:00';
      timestamptz       
------------------------
 2000-01-01 06:00:00+08
(1 row)

Datafusion

  • Timestamp(TimeUnit, Option<String>)
    • we have
      • TimeUnit::Second
      • TimeUnit::MilliSecond
      • TImeUnit::MicroSecond
      • TimeUnit::NanoSecond
    • which store number of seconds/millis/micros/nanos from 1970-01-01T00:00:00
    • most of the timestamp related functions output Timestamp(TimeUnit::NanoSecond, None)
  • We only have timestamp literal but no timestamp with time zone
  • timestamp xxx outputs Timestamp(TimeUnit::NanoSecond, None)
  1. if xxx contains no time zone, it automatically applies local time, parse it, convert it to utc time zone, and then drop the time zone Should Cast(UTF-8 AS Timestamp) apply local time zone? #3080
select cast('2000-01-01T00:00:00' as timestamp);
+------------------------------------------------------------------+
| CAST(Utf8("2000-01-01T00:00:00") AS Timestamp(Nanosecond, None)) |
+------------------------------------------------------------------+
| 1999-12-31 16:00:00                                              |
+------------------------------------------------------------------+
1 row in set. Query took 0.004 seconds.
  1. if xxx contains time zone, it's parsed correctly, then converted to utc time zone, and then drop the time zone
❯ select timestamp '2000-01-01T00:00:00+02:00';
+------------------------------------------------------------------------+
| CAST(Utf8("2000-01-01T00:00:00+02:00") AS Timestamp(Nanosecond, None)) |
+------------------------------------------------------------------------+
| 1999-12-31 22:00:00                                                    |
+------------------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.

Proposal


Date

postgresql image

Posgresql

  • supported these formats
    image

Datafusion

  • among all the format above, we only support the first 1999-01-08

Proposal

  • I don't think there're any issues for Date
  • We could consider add another 2 ISO 8601 formats (i.e. 19990108 and 990108) Chrono strictly follows ISO 8601. I think supporting all 8601 date formats makes sense.

Time

Postgresql

  • time xxx output time that requires 8 bytes
  • time xxx with time zone that requires 12 bytes, I have no idea why we need 4 more bytes here since timestamp with time zone only requires 8 bytes

Datafusion

Proposal


Interval

Postgresql

  • requires 16 bytes
  • resolution as microseconds
  • the outputs for different operators & inputs
    image
    image

Datafusion

reference: https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L237

  • Interval(IntervalUnit)
  • we have following units
    • IntervalUnit::YearMonth
      • number of months
      • stored as 32-bit integer
    • IntervalUnit::DayTime
      • stored as 2 contiguous 32-bit integers (days, millisseconds), 8 bytes in total
    • IntervalUnit::MonthDayNano
    • a triple of the number of (months, days, nanoseconds)
      • month is stored as 32-bit integers
      • day is stored as 32-bit integers
      • nanosecond is stored as 64 bit integers
      • 16 bytes in total
  • interval xxx output Interval(DayTime)
select interval '1 hour 1 second';
+------------------------------------------------+
| IntervalDayTime("3601000")                     |
+------------------------------------------------+
| 0 years 0 mons 0 days 1 hours 0 mins 1.00 secs |
+------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • interval xxx support floating number seconds
select interval '0.1 second';
+-------------------------------------------------+
| IntervalDayTime("100")                          |
+-------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.100 secs |
+-------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • if it's less than microsecond, it'll truncated
select interval '0.0001 second';
+------------------------------------------------+
| IntervalDayTime("0")                           |
+------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.00 secs |
+------------------------------------------------+
1 row in set. Query took 0.000 seconds.
  • we cannot add interval(DayTime) to Timestamp(NanoSecond, None), perhaps the reason here is the difference of resolution
select timestamp '2000-01-01Z' + interval '1 day';
Plan("'Timestamp(Nanosecond, None) + Interval(DayTime)' can't be evaluated because there isn't a common type to coerce the types to")
  • we can add interval(DayTime) to Date
select DATE '2000-01-01' + INTERVAL '1 day';
+--------------------------------------------------------------------+
| CAST(Utf8("2000-01-01") AS Date32) + IntervalDayTime("4294967296") |
+--------------------------------------------------------------------+
| 2000-01-01                                                         |
+--------------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.
- it breaks while we have hour (or other smaller units) interval #3093 this is solved

Proposal

  • Consider make INTERVAL xxx outputs Interval(MonthDayNano) instead of Interval(DayTime) as it's easier to align with our Timestamp(NanoSecond, None)
  • Carefully design the outputs for operators like what postgresql has
  • we could think about whether we really need timestamp with time zone - timestamp ... While comparing Timestamp(TimeUnit, TimeZone) Timestamp(TimeUnit, None). the one with time zone will be converted to utc and drop the timezone (it simply drop the timezone internally).this is what postgresql has
willy=# select timestamp with time zone '2000-01-01T00:00:00Z' - timestamp '2000-01-01T00:00:00';
 ?column? 
----------
 08:00:00
(1 row)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions