Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jun 10, 2020

What changes were proposed in this pull request?

Fix the bug in microseconds rebasing during transitions from one standard time zone offset to another one. In the PR, I propose to change the implementation of rebaseGregorianToJulianMicros which performs rebasing via local timestamps. In the case of overlapping:

  1. Check that the original instant belongs to earlier or later instant of overlapped local timestamp.
  2. If it is an earlier instant, take zone and DST offsets from the previous day otherwise
  3. Set time zone offsets to Julian timestamp from the next day.

Note: The fix assumes that transitions cannot happen more often than once per 2 days.

Why are the changes needed?

Current implementation handles timestamps overlapping only during daylight saving time but overlapping can happen also during transition from one standard time zone to another one. For example in the case of Asia/Hong_Kong, the time zone switched from Japan Standard Time (UTC+9) to Hong Kong Time (UTC+8) on Sunday, 18 November, 1945 01:59:59 AM. The changes allow to handle the special case as well.

Does this PR introduce any user-facing change?

It might affect micros rebasing in before common era when not-optimised version of rebaseGregorianToJulianMicros() is used directly.

How was this patch tested?

  1. By existing tests in DateTimeUtilsSuite, RebaseDateTimeSuite, DateFunctionsSuite, DateExpressionsSuite and TimestampFormatterSuite.
  2. Added new test to RebaseDateTimeSuite
  3. Regenerated gregorian-julian-rebase-micros.json with the step of 30 minutes, and got the same JSON file. The JSON file isn't affected because previously it was generated with the step of 1 week. And the spike in diffs/switch points during 1 hour of timestamp overlapping wasn't detected.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123775 has finished for PR 28787 at commit 6227ef6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 11, 2020

The build #28787 (comment) failed on the assert for JDK calls:

      val ldt = LocalDateTime.of(1945, 11, 18, 1, 30, 0)
      val earlierMicros = instantToMicros(ldt.atZone(hkZid).withEarlierOffsetAtOverlap().toInstant)
      val laterMicros = instantToMicros(ldt.atZone(hkZid).withLaterOffsetAtOverlap().toInstant)
      assert(earlierMicros + MICROS_PER_HOUR === laterMicros)

because jenkins uses "old" JDK:

JENKINS_MASTER_HOSTNAME=amp-jenkins-master
JAVA_HOME=/usr/java/jdk1.8.0_191

which has an outdated time zone database, see https://bugs.openjdk.java.net/browse/JDK-8228469

     Hong Kong's 1941-06-15 spring-forward transition was at 03:00, not
     03:30. Its 1945 transition from JST to HKT was on 11-18 at 02:00,
     not 09-15 at 00:00. In 1946 its spring-forward transition was on
     04-21 at 00:00, not the previous day at 03:30. From 1946 through
     1952 its fall-back transitions occurred at 04:30, not at 03:30.
     In 1947 its fall-back transition was on 11-30, not 12-30.
     (Thanks to P Chan.)

@cloud-fan @HyukjinKwon @dongjoon-hyun @srowen Can we upgrade JDK on jenkins machines?

@MaxGekk MaxGekk changed the title [SPARK-31959][SQL] Fix Gregorian-Julian micros rebasing while switching standard time zone offset [SPARK-31959][SQL][test-java11] Fix Gregorian-Julian micros rebasing while switching standard time zone offset Jun 11, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 11, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123836 has finished for PR 28787 at commit c655c8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123846 has finished for PR 28787 at commit c655c8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 12, 2020

More context:

To do datetime rebase more efficiently, we put the diffs between two calendars in JSON files, to avoid calculating the diffs every time.

To generate the JSON file, we run the rebase function for all the datetime values from year 0001 to year 2037, using one week as the step. This is not accurate and we miss diffs during some short period.

The good news is: although the JSON files do not match the rebase function, they are correct. The rebase function has a bug. We have a random test to make sure the rebase function matches the generated JSON files, and that's why we found this bug.

That said, this is not a real bug as it doesn't affect end users, but it makes the test flaky.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 12, 2020

thanks, merging to master!

@cloud-fan cloud-fan closed this in c259844 Jun 12, 2020
MaxGekk added a commit to MaxGekk/spark that referenced this pull request Jun 12, 2020
…while switching standard time zone offset

Fix the bug in microseconds rebasing during transitions from one standard time zone offset to another one. In the PR, I propose to change the implementation of `rebaseGregorianToJulianMicros` which performs rebasing via local timestamps. In the case of overlapping:
1. Check that the original instant belongs to earlier or later instant of overlapped local timestamp.
2. If it is an earlier instant, take zone and DST offsets from the previous day otherwise
3. Set time zone offsets to Julian timestamp from the next day.

Note: The fix assumes that transitions cannot happen more often than once per 2 days.

Current implementation handles timestamps overlapping only during daylight saving time but overlapping can happen also during transition from one standard time zone to another one. For example in the case of `Asia/Hong_Kong`, the time zone switched from `Japan Standard Time` (UTC+9) to `Hong Kong Time` (UTC+8) on _Sunday, 18 November, 1945 01:59:59 AM_. The changes allow to handle the special case as well.

It might affect micros rebasing in before common era when not-optimised version of `rebaseGregorianToJulianMicros()` is used directly.

1. By existing tests in `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `TimestampFormatterSuite`.
2. Added new test to `RebaseDateTimeSuite`
3. Regenerated `gregorian-julian-rebase-micros.json` with the step of 30 minutes, and got the same JSON file. The JSON file isn't affected because previously it was generated with the step of 1 week. And the spike in diffs/switch points during 1 hour of timestamp overlapping wasn't detected.

Closes apache#28787 from MaxGekk/HongKong-tz-1945.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c259844)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@dongjoon-hyun
Copy link
Member

Hi, All. As I commented on the backporting PR, this causes a weird consistent failure on research-jenkins-worker-09. Could you take a look at that?

cloud-fan pushed a commit that referenced this pull request Jun 16, 2020
…s pure

### What changes were proposed in this pull request?
1. Set the given time zone as the first parameter of `RebaseDateTime`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` to Java 7 `GregorianCalendar`.
```scala
    val cal = new Calendar.Builder()
      // `gregory` is a hybrid calendar that supports both the Julian and Gregorian calendar systems
      .setCalendarType("gregory")
    ...
      .setTimeZone(tz)
      .build()
```
This makes the instance of the calendar independent from the default JVM time zone.

2. Change type of the first parameter from `ZoneId` to `TimeZone`. This allows to avoid unnecessary conversion from `TimeZone` to `ZoneId`, for example in
```scala
  def rebaseJulianToGregorianMicros(micros: Long): Long = {
    ...
      if (rebaseRecord == null || micros < rebaseRecord.switches(0)) {
        rebaseJulianToGregorianMicros(timeZone.toZoneId, micros)
```
and back to `TimeZone` inside of `rebaseJulianToGregorianMicros(zoneId: ZoneId, ...)`

3. Modify tests in `RebaseDateTimeSuite`, and set the default JVM time zone only for functions that depend on it.

### Why are the changes needed?
1. Ignoring passed parameter and using a global variable is bad practice.
2. Dependency from the global state doesn't allow to run the functions in parallel otherwise there is non-zero probability that the functions may return wrong result if the default JVM is changed during their execution.
3. This open opportunity for parallelisation of JSON files generation `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`. Currently, the tests `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'` generate the JSON files by iterating over all time zones sequentially w/ step of 1 week. Due to the large step, we can miss some spikes in diffs between 2 calendars (Java 8 Gregorian and Java 7 hybrid calendars) as the PR #28787 fixed and #28816 should fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running existing tests from `RebaseDateTimeSuite`.

Closes #28824 from MaxGekk/pure-micros-rebasing.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request Jun 17, 2020
### What changes were proposed in this pull request?
1. Change the max step from 1 week to 30 minutes in the tests `RebaseDateTimeSuite`.`generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`.
2. Parallelise JSON files generation in the function `generateRebaseJson` by using `ThreadUtils.parmap`.

### Why are the changes needed?
1. To prevent the bugs that are fixed by #28787 and #28816.
2. The parallelisation speeds up JSON file generation.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By generating the JSON file `julian-gregorian-rebase-micros.json`.

Closes #28827 from MaxGekk/rebase-30-min.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the HongKong-tz-1945 branch December 11, 2020 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants