Closed
Description
!!! Please correct me if i'm wrong !!!
Intro
- This changes breaks some original behaviors.
- UTC+8 is the local time zone in my examples.
- As our goal is to be Postgresql compatible, this proposal is mainly following
- I plan to do Data Type Formatting Function and Date/Time Functions and Operators after most of the items here are finalized
Design Principle
- align with postgresql in most cases
- use nanoseconds for the default timestamp resolution (unlike postgresql and pyspark's microseconds)
- use utc+0 as default timezone for timestamp with time zone
Let's Begin with Postgresql's Date/Time
Let's Start to Compare
Timestamp
Postgresql
- uses 8 bytes for both
timestamp
andtimestamp with time zone
. (note that time zone is included in these 8 bytes) - uses microsecond as resolution, which is the number of microseconds from
1970-01-01T00:00:00
- has 7 kinds of floating point precision, from
timestamp(0)
totimestamp(6)
timestamp(0)
rounds to secondstimestamp(3)
rounds to millisecondstimestamp(6)
rounds to microseconds
timestamp 'xxx'
outputtimestamp
- if
xxx
does't contain time zone info, it just works as what you think
willy=# select timestamp '2000-01-01T00:00:00';
timestamp
---------------------
2000-01-01 00:00:00
(1 row)
- if
xxx
contains time zone info, time zone is just ignored. (i believe that this is a surprise for some people) e.g.
willy=# select timestamp '2000-01-01T00:00:00+08:00';
timestamp
---------------------
2000-01-01 00:00:00
(1 row)
timestamp with time zone 'xxx'
outputtimestamp with time
1 ifxxx
contains no time zone, it assume it's local time
willy=# select timestamp with time zone '2000-01-01T00:00:00';
timestamptz
------------------------
2000-01-01 00:00:00+08
(1 row)
2 if xxx
contains time zone, it'll be converted to your local time zone
willy=# select timestamp with time zone '2000-01-01T00:00:00+02:00';
timestamptz
------------------------
2000-01-01 06:00:00+08
(1 row)
Datafusion
Timestamp(TimeUnit, Option<String>)
- we have
TimeUnit::Second
TimeUnit::MilliSecond
TImeUnit::MicroSecond
TimeUnit::NanoSecond
- which store number of seconds/millis/micros/nanos from
1970-01-01T00:00:00
- most of the timestamp related functions output
Timestamp(TimeUnit::NanoSecond, None)
- we have
- We only have
timestamp
literal but notimestamp with time zone
timestamp xxx
outputsTimestamp(TimeUnit::NanoSecond, None)
- if
xxx
contains no time zone, it automatically applies local time, parse it, convert it to utc time zone, and then drop the time zone ShouldCast(UTF-8 AS Timestamp)
apply local time zone? #3080
❯ select cast('2000-01-01T00:00:00' as timestamp);
+------------------------------------------------------------------+
| CAST(Utf8("2000-01-01T00:00:00") AS Timestamp(Nanosecond, None)) |
+------------------------------------------------------------------+
| 1999-12-31 16:00:00 |
+------------------------------------------------------------------+
1 row in set. Query took 0.004 seconds.
- if
xxx
contains time zone, it's parsed correctly, then converted to utc time zone, and then drop the time zone
❯ select timestamp '2000-01-01T00:00:00+02:00';
+------------------------------------------------------------------------+
| CAST(Utf8("2000-01-01T00:00:00+02:00") AS Timestamp(Nanosecond, None)) |
+------------------------------------------------------------------------+
| 1999-12-31 22:00:00 |
+------------------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.
Proposal
- make
timestamp xxx
work like postgresql does - add
timestamp with time zone
, i believe there're lots of works and discussions to do: Cast Kernel Ignores Timezone arrow-rs#1936 Handle the timezone in extract week in temporal.rs arrow-rs#1380 Can not create a TimestampNanosecondArray that has a specified timezone arrow-rs#597 - make the local time zone as UTC by default (timestamp is hard, i think this could prevent some ambiguities)
- add
set time zone to xxx
to change the local time zone
Date
postgresql image
Posgresql
Datafusion
- among all the format above, we only support the first
1999-01-08
Proposal
- I don't think there're any issues for
Date
- We could consider add another 2 ISO 8601 formats (i.e.
19990108
and990108
) Chrono strictly follows ISO 8601. I think supporting all 8601 date formats makes sense.
Time
Postgresql
time xxx
outputtime
that requires 8 bytestime xxx with time zone
that requires 12 bytes, I have no idea why we need 4 more bytes here sincetimestamp with time zone
only requires 8 bytes
Datafusion
- We don't have
time
literal for now, let's wait for feat: Add support for TIME literal values #3010
Proposal
- I personally never used
time with time zone
. I have no clue when we need it.arrow-rs
'stime
datatype contains no timezone. Perhaps we need not to implement this. - let's wait for feat: Add support for TIME literal values #3010
Interval
Postgresql
Datafusion
reference: https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L237
Interval(IntervalUnit)
- we have following units
IntervalUnit::YearMonth
- number of months
- stored as 32-bit integer
IntervalUnit::DayTime
- stored as 2 contiguous 32-bit integers (days, millisseconds), 8 bytes in total
IntervalUnit::MonthDayNano
- a triple of the number of (months, days, nanoseconds)
- month is stored as 32-bit integers
- day is stored as 32-bit integers
- nanosecond is stored as 64 bit integers
- 16 bytes in total
interval xxx
outputInterval(DayTime)
❯ select interval '1 hour 1 second';
+------------------------------------------------+
| IntervalDayTime("3601000") |
+------------------------------------------------+
| 0 years 0 mons 0 days 1 hours 0 mins 1.00 secs |
+------------------------------------------------+
1 row in set. Query took 0.000 seconds.
interval xxx
support floating number seconds
❯ select interval '0.1 second';
+-------------------------------------------------+
| IntervalDayTime("100") |
+-------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.100 secs |
+-------------------------------------------------+
1 row in set. Query took 0.000 seconds.
- if it's less than microsecond, it'll truncated
❯ select interval '0.0001 second';
+------------------------------------------------+
| IntervalDayTime("0") |
+------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.00 secs |
+------------------------------------------------+
1 row in set. Query took 0.000 seconds.
- we cannot add
interval(DayTime)
toTimestamp(NanoSecond, None)
, perhaps the reason here is the difference of resolution
❯ select timestamp '2000-01-01Z' + interval '1 day';
Plan("'Timestamp(Nanosecond, None) + Interval(DayTime)' can't be evaluated because there isn't a common type to coerce the types to")
- we can add
interval(DayTime)
toDate
❯ select DATE '2000-01-01' + INTERVAL '1 day';
+--------------------------------------------------------------------+
| CAST(Utf8("2000-01-01") AS Date32) + IntervalDayTime("4294967296") |
+--------------------------------------------------------------------+
| 2000-01-01 |
+--------------------------------------------------------------------+
1 row in set. Query took 0.000 seconds.
- We don't have
Time
now, let's wait for feat: Add support for TIME literal values #3010
Proposal
- Consider make
INTERVAL xxx
outputs Interval(MonthDayNano) instead of Interval(DayTime) as it's easier to align with ourTimestamp(NanoSecond, None)
- Carefully design the outputs for operators like what postgresql has
we could think about whether we really needWhile comparingtimestamp with time zone - timestamp
...Timestamp(TimeUnit, TimeZone)
Timestamp(TimeUnit, None)
. the one with time zone will be converted to utc and drop the timezone (it simply drop the timezone internally).this is what postgresql has
willy=# select timestamp with time zone '2000-01-01T00:00:00Z' - timestamp '2000-01-01T00:00:00';
?column?
----------
08:00:00
(1 row)
Metadata
Metadata
Assignees
Labels
No labels